Conjugate priors and analytic posteriors

Part 7 — Bayesian methods

Learning objectives

Define a CONJUGATE prior as one for which the posterior belongs to the SAME parametric family
Derive the Gamma–Poisson posterior update for a rate parameter
Derive the Normal–Normal (known σ) posterior update as the inverse-variance-weighted combination of prior and data
Recognise the Normal–Inverse-Gamma joint conjugate for (μ, σ²)
State the Diaconis–Ylvisaker (1979) characterisation: every exponential-family likelihood has a conjugate prior in the same family

§7.1 met conjugacy through the Beta–Binomial example: the posterior had the same Beta form as the prior, with hyperparameters incremented by the data counts. That property — closure of the prior family under Bayesian updating — is one of the most useful structural facts in Bayesian inference. It exists for all the workhorse parametric models. §7.2 develops the broader theory and shows the three most useful conjugate pairs in detail.

What conjugacy means

A family $\mathcal{P}$ of prior distributions is CONJUGATE to a likelihood $p(D \mid \theta)$ if for every prior $p \in \mathcal{P}$ and every data set D, the posterior $p(\theta \mid D) \propto p(D \mid \theta) , p(\theta)$ also lies in $\mathcal{P}$ . The posterior is a member of the SAME family, with updated hyperparameters. This is hugely convenient: no integration needed to find the posterior, just a closed-form update rule on hyperparameters.

Conjugate priors exist for every EXPONENTIAL FAMILY likelihood. The exponential family is the broadest class for which $p(D \mid \theta)$ can be written as $h(D) \exp[\eta(\theta)^T T(D) - A(\theta)]$ for some natural parameter η, sufficient statistic T, and log-partition A. The result is captured precisely by:

Theorem (Diaconis & Ylvisaker 1979): For an exponential-family likelihood with natural parameter η, the family of densities $p(\eta) \propto \exp[\lambda_1 \eta - \lambda_0 A(\eta)]$ indexed by hyperparameters $(\lambda_0, \lambda_1)$ is conjugate. The update rule is $(\lambda_0, \lambda_1) \to (\lambda_0 + 1, \lambda_1 + T(D))$ per observation.

This is THE structural reason conjugacy works wherever the likelihood is exponential family. The Diaconis-Ylvisaker theorem says "find the sufficient statistic; the conjugate prior puts a prior on the natural parameter linear in that sufficient statistic." Beta is conjugate to Bernoulli because Bernoulli sufficient statistics are counts. Gamma is conjugate to Poisson because Poisson sufficient statistics are counts and exposures. Normal is conjugate to Normal-with-known-variance for the same structural reason. Everything follows from the exponential-family form.

Gamma–Poisson for rate parameters

You observe $N$ independent counts $k_1, \ldots, k_N$ from a Poisson distribution with rate $\lambda$ . The likelihood (with $S = \sum k_i$ the total count) is

p(S, N \mid \lambda) \propto \lambda^{S} e^{-N \lambda}.

A Gamma(α, β) prior on λ (rate parameterisation: density $\lambda^{\alpha - 1} e^{-\beta \lambda}$ ) gives a posterior

p(\lambda \mid D) \propto \lambda^{S + \alpha - 1} e^{-(N + \beta) \lambda} = \text{Gamma}(\alpha + S, \, \beta + N).

Update rule: α gains S (the total observed counts); β gains N (the total exposure). The posterior mean is

E[\lambda \mid D] = \frac{\alpha + S}{\beta + N}.

Same precision-weighting story as §7.1: as N grows, the posterior mean → S/N = MLE; for small N the posterior is anchored to α/β = prior mean. Use case: estimating the rate of rare events (insurance claims, defects per batch, citations per paper). Gelman's 8-schools example uses Gamma-Poisson on subgroup rates.

Normal–Normal with known σ

You observe N iid samples $y_1, \ldots, y_N \sim \mathcal{N}(\mu, \sigma^2)$ with σ KNOWN. The likelihood is Normal in μ with mean equal to the sample mean ȳ and standard deviation $\sigma/\sqrt{N}$ . The Normal(μ₀, τ₀²) prior conjugates: the posterior is

\mu \mid D \sim \mathcal{N}\left(\mu_n, \tau_n^2\right),

where the posterior PRECISION (inverse variance) is the sum of prior and data precisions:

\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{N}{\sigma^2},

and the posterior MEAN is the precision-weighted average:

\mu_n = \tau_n^2 \left( \frac{\mu_0}{\tau_0^2} + \frac{N \bar{y}}{\sigma^2} \right).

The update rule has the cleanest interpretation in the LITERATURE: precisions add. The posterior precision is the sum of prior and data precisions; the posterior mean is the precision-weighted average of prior mean and sample mean. As N grows, data precision N/σ² dominates and the posterior tracks the sample mean. As τ₀ → ∞ (vague prior), the posterior approaches the data-only Normal(ȳ, σ²/N) — the same shape as the frequentist sampling distribution of ȳ. Frequentist and Bayesian inference numerically COINCIDE under vague conjugate priors, even though their semantic frames differ.

Normal–Inverse-Gamma: joint mean and variance

The realistic case has BOTH μ AND σ² unknown. The conjugate prior is the Normal–Inverse-Gamma (NIG): factor the prior as $p(\mu, \sigma^2) = p(\mu \mid \sigma^2) p(\sigma^2)$ with

\mu \mid \sigma^2 \sim \mathcal{N}(\mu_0, \sigma^2 / \kappa_0), \quad \sigma^2 \sim \text{Inverse-Gamma}(\alpha_0, \beta_0).

Hyperparameters: μ₀ (prior mean), κ₀ (prior "mean precision pseudo-count" — equivalent prior sample size for the mean), α₀, β₀ (prior shape and rate on the inverse variance). Observe N iid Normal samples with sample mean ȳ and sample variance s². The posterior is again NIG with

\kappa_n = \kappa_0 + N, \quad \mu_n = (\kappa_0 \mu_0 + N \bar{y}) / \kappa_n,

\alpha_n = \alpha_0 + N/2, \quad \beta_n = \beta_0 + (N s^2)/2 + \kappa_0 N (\bar{y} - \mu_0)^2 / (2 \kappa_n).

The MARGINAL posterior of μ (integrating out σ²) is a Student-t distribution with $2\alpha_n$ degrees of freedom, location $\mu_n$ , and scale $\sqrt{\beta_n / (\alpha_n \kappa_n)}$ . This is the Bayesian analogue of the frequentist t-CI under a vague NIG prior — same shape, same root-N rate, different semantics. The Student-t naturally accommodates uncertainty in σ²; as N grows and σ² becomes precisely estimated, the t reduces to a Normal.

Dirichlet–Multinomial: when you have K categories

The K-dimensional analogue of Beta–Binomial. Observe N draws from a Multinomial with category probabilities $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$ summing to 1. Each category gets observed $n_k$ times. A Dirichlet(α₁, …, α_K) prior gives a Dirichlet(α₁ + n₁, …, α_K + n_K) posterior. Each prior pseudo-count α_k absorbs the corresponding observed count n_k. Used pervasively in topic modeling (latent Dirichlet allocation), categorical-data analysis, and natural-language Bayesian models.

When conjugacy fails

Many real-world likelihoods are NOT exponential family in tractable form. Logistic regression with a Normal prior is non-conjugate; Cauchy errors; mixture models; hierarchical models with multiple latent layers; non-linear regression. When the closed-form posterior is unavailable, you simulate from the posterior via MCMC: Metropolis (§7.3), Gibbs (§7.4), Hamiltonian Monte Carlo (§7.5). MCMC is the workhorse of modern applied Bayes precisely because conjugacy fails to cover modern modelling needs.

Try it

Open Gamma–Poisson. Set α = 2, β = 1 (prior mean 2). Set N = 5, S = 12. The MLE is 12/5 = 2.4. Posterior is Gamma(14, 6) with mean 14/6 ≈ 2.33. Notice the posterior mean is between the prior (2.0) and the MLE (2.4), pulled slightly toward the prior.
Stay on Gamma–Poisson. Crank N to 100 with S = 240 (preserving MLE = 2.4). Posterior becomes Gamma(242, 101) with mean 240/101 ≈ 2.40 — essentially the MLE. The prior is overwhelmed.
Switch to Normal–Normal. Set μ₀ = 0, τ₀ = 1 (skeptical prior), σ = 2, N = 4, ȳ = 1.5. The posterior mean is the precision-weighted average. Data precision is 4/4 = 1, prior precision is 1. Posterior mean ≈ (0×1 + 1.5×1)/2 = 0.75. Compare to MLE = 1.5: prior and data have equal weight here.
Stay on Normal–Normal. Increase N to 100. Data precision becomes 100/4 = 25; prior precision still 1. Posterior mean ≈ (0×1 + 1.5×25)/26 ≈ 1.44. Far closer to the MLE — data dominate.
Switch to Normal–Inverse-Gamma. With NN you assumed σ KNOWN; here σ² is INFERRED jointly with μ. Notice the posterior on precision τ = 1/σ² shifts after observing data: high s² pulls posterior toward lower precision (higher σ²). The marginal posterior on μ becomes a Student-t with 2α_n degrees of freedom — appropriately wider than the Normal–Normal posterior to reflect the additional uncertainty about σ².
The unifying observation: for every conjugate family, the update rule is structurally the same — add the prior pseudo-counts (or precision-equivalents) to the data counts (or precision-equivalents). This is the Diaconis-Ylvisaker form.

A bridge structural engineer monitors a specific cable for strain. Yearly maximum strain follows Normal(μ, σ²) with σ KNOWN from manufacturer spec at 0.4 MPa. Prior μ₀ = 3.0 MPa with prior SD τ₀ = 0.2 MPa (substantive judgment from design loads). Five years of data: ȳ = 3.6 MPa. What is the posterior mean and 95% credible interval for μ? Compare to the simple data-only μ ≈ 3.6, SE ≈ 0.18.

What you now know

Conjugate priors give closed-form posterior updates for every exponential-family likelihood. The three workhorses: Gamma–Poisson for rates, Normal–Normal for means with known variance, Normal–Inverse-Gamma for joint mean and variance. Dirichlet–Multinomial for K-category data. Diaconis-Ylvisaker (1979) characterised which models admit conjugate priors and gave a constructive recipe. Closed-form posteriors are the gold standard when available; §§7.3–7.5 develop MCMC for non-conjugate models, which is the majority of modern applied Bayes.

References

Diaconis, P., Ylvisaker, D. (1979). "Conjugate priors for exponential families." Annals of Statistics 7(2), 269–281. (The characterisation theorem.)
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.), Chapter 3. CRC.
Bernardo, J.M., Smith, A.F.M. (1994). Bayesian Theory. Wiley. (Comprehensive treatment of conjugacy.)
Robert, C.P. (2007). The Bayesian Choice (2nd ed.). Springer. (Decision-theoretic Bayesian foundations.)
Murphy, K. (2007). "Conjugate Bayesian analysis of the Gaussian distribution." Technical report, UBC. (Excellent self-contained derivation of Normal–Inverse-Gamma.)