Prior, likelihood, posterior — the mechanics

Part 7 — Bayesian methods

Learning objectives

  • State Bayes' theorem and identify its three pillars: prior, likelihood, posterior
  • Compute the Beta–Binomial posterior as Beta(α + k, β + n − k)
  • Interpret the posterior mean as a precision-weighted compromise between prior and data
  • Construct and interpret a 95% CREDIBLE INTERVAL (and distinguish it from a confidence interval)
  • Recognise the special role of the UNIFORM prior Beta(1, 1) and JEFFREYS prior Beta(½, ½)

Frequentist statistics treats the parameter θ as a FIXED but unknown constant and asks: "what data would I see if I repeated the experiment many times?" Bayesian statistics treats θ as a RANDOM VARIABLE with a probability distribution that encodes uncertainty, and asks: "given the data I actually saw, what should I now believe about θ?" The Bayesian recipe is a single equation — Bayes' theorem — and a discipline for using it well.

Bayes' theorem

For a parameter θ and observed data D:

p(θD)=p(Dθ)p(θ)p(D).p(\theta \mid D) = \frac{p(D \mid \theta) \, p(\theta)}{p(D)}.

Three pillars:

  • Prior p(θ)p(\theta): what you believed about θ BEFORE seeing the data. Encodes subject-matter knowledge, regularisation, or deliberate scepticism.
  • Likelihood p(Dθ)p(D \mid \theta): how likely the observed data are at each candidate θ. Same object that drives MLE in §1.2; in Bayes it is multiplied with the prior rather than maximised.
  • Posterior p(θD)p(\theta \mid D): the rational compromise. What you believe about θ AFTER incorporating the data. ALL inference in Bayesian statistics is about reading off summaries (mean, median, credible interval) of the posterior.

The denominator p(D)=p(Dθ)p(θ)dθp(D) = \int p(D \mid \theta) p(\theta) d\theta is the MARGINAL LIKELIHOOD or "evidence". It is a normalising constant — it does not depend on θ. For most inferential questions about θ you can ignore it, since it merely rescales the posterior. It returns in §7.7 when comparing competing models.

The canonical example: Beta–Binomial

Let θ ∈ [0, 1] be the probability of a "success". Observe n binary trials with k successes. The likelihood is

p(kθ,n)=(nk)θk(1θ)nk.p(k \mid \theta, n) = \binom{n}{k} \theta^k (1 - \theta)^{n - k}.

Choose a Beta(α, β) prior for θ. Then the posterior is

p(θk,n)θk+α1(1θ)nk+β1=Beta(α+k,β+nk).p(\theta \mid k, n) \propto \theta^{k + \alpha - 1} (1 - \theta)^{n - k + \beta - 1} = \text{Beta}(\alpha + k, \, \beta + n - k).

The prior is updated by ADDING the observed counts: α gains k, β gains n − k. The posterior is again a Beta distribution — we call this the CONJUGATE case (full treatment in §7.2). Even better, this update has an intuitive interpretation: α and β are PSEUDO-COUNTS — "as if I had previously seen α successes and β failures". The data adds k more successes and n − k more failures. The posterior reflects the combined evidence.

The posterior mean as precision-weighted compromise

For Beta(α + k, β + n − k) the posterior mean is

E[θD]=α+kα+β+n=α+βα+β+nαα+β+nα+β+nkn.E[\theta \mid D] = \frac{\alpha + k}{\alpha + \beta + n} = \frac{\alpha + \beta}{\alpha + \beta + n} \cdot \frac{\alpha}{\alpha + \beta} + \frac{n}{\alpha + \beta + n} \cdot \frac{k}{n}.

The right-hand side decomposes the posterior mean into a CONVEX COMBINATION of the prior mean α/(α+β)\alpha/(\alpha + \beta) and the maximum-likelihood estimate k/nk/n. The weights are the relative PRECISIONS (effective sample sizes) of prior and data. When n is small relative to (α + β), the posterior is anchored to the prior; when n is large, the posterior closely tracks the MLE. This precision-weighting is the soul of the Bayesian update.

Credible intervals vs confidence intervals

A 95% CREDIBLE INTERVAL is an interval (L, U) such that

Pr(θ(L,U)D)=0.95.\Pr(\theta \in (L, U) \mid D) = 0.95.

Read literally: "given the data, there is a 95% probability that θ lies in (L, U)". This is the interpretation people often want from a confidence interval — but a confidence interval (§3) does NOT support that statement (it is a procedural guarantee over repeated sampling, not a probability statement about θ). The credible interval delivers what the frequentist CI almost always pretends to deliver.

Two conventions: EQUAL-TAILED (cut off 2.5% from each tail) and HIGHEST POSTERIOR DENSITY (the shortest interval containing 95% probability mass). For symmetric posteriors they coincide; for skewed posteriors HPD is preferable.

Choosing a prior

Three commonly-used Beta priors:

  • Uniform (α = β = 1): Beta(1, 1) is the flat prior on [0, 1]. "No prior information." Posterior is Beta(1 + k, 1 + n − k); posterior mean is (k + 1)/(n + 2) — Laplace's rule of succession (1814).
  • Jeffreys (α = β = ½): Beta(½, ½) is the invariant prior. Theoretical appeal: invariant under reparameterisation. Slightly less flat than uniform.
  • Informative: α = 10, β = 90 encodes "I think θ is around 10% with moderate confidence." Strong subject-matter priors require strong subject-matter justification; weak priors leave the data in charge.

The DESIGN choice of prior should be honest and pre-specified, just like a hypothesis-testing α level. SENSITIVITY ANALYSIS for priors (re-fit with several reasonable priors) is best practice for any consequential inference.

Sequential updating: yesterday's posterior is today's prior

If you observe data D1D_1 then later D2D_2, the posterior after both equals what you would get by treating the post-D1D_1 posterior as the prior for D2D_2. Mathematically: p(θD1,D2)p(D2θ)p(θD1)p(\theta \mid D_1, D_2) \propto p(D_2 \mid \theta) , p(\theta \mid D_1). Bayesian inference is intrinsically sequential — updates respect the order of evidence but not its specific timing. This is the foundation of online learning, A/B testing with sequential analysis, and any streaming-data setting.

Posterior BuilderInteractive figure — enable JavaScript to interact.

Try it

  • Click Uniform prior preset. Beta(1, 1) is FLAT — the prior is shown as a horizontal line. With k = 14 successes in n = 20 trials, the posterior is Beta(15, 7), peaking at θ ≈ 14/20 = 0.7. Notice: when the prior is uniform, the posterior is JUST a re-normalised likelihood. Frequentist MLE and Bayesian MAP coincide here.
  • Click Skeptic (low θ). Now Beta(2, 18) encodes "I expect θ ≈ 0.1". With the same data (k = 14, n = 20), the posterior is pulled LEFT relative to the likelihood peak — the prior's skepticism survives. The posterior mean is roughly (2 + 14) / (20 + 20) = 0.40, halfway between the prior mean and the data.
  • Click Believer (high θ). Beta(18, 2) encodes "I expect θ ≈ 0.9". Now the posterior is pulled RIGHT of the data's likelihood peak. The two strong priors give opposite-direction shifts — making explicit the role of subjective prior choice.
  • Click Tons of data. With n = 200 successes in k = 140, the prior loses its grip: even a strongly-skeptical prior Beta(2, 18) would give posterior Beta(142, 78) with mean ≈ 0.645 — close to the MLE of 0.70. Posterior ≈ likelihood once n >> (α + β).
  • Manually drag α and β to (5, 5) and n to 0. Posterior = prior (no data to update). Then drag n up while keeping k/n constant at 0.5: watch the credible-interval width SHRINK roughly like 1/√n — the same root-n rate as classical SEs. Bayesian and frequentist convergence rates are equivalent under regularity; only the interpretation differs.

An A/B test reports 14 conversions out of 20 visits for the new design (variant B) and 8 conversions out of 20 for the old (variant A). Using uniform priors for each variant's conversion rate, what is the posterior probability that B's conversion rate exceeds A's?

What you now know

Bayes' theorem turns a parametric model into an updating machine: the data multiplies the prior (yielding a posterior up to a normalising constant) and you read off whatever summary the decision requires. For the canonical Beta–Binomial, the update is just count addition. The posterior mean is a precision-weighted compromise. Credible intervals deliver direct probability statements about θ. Prior choice is a substantive modelling decision deserving the same care as any other. §7.2 develops the broader CONJUGATE family (Normal–Normal, Gamma–Poisson, Dirichlet–Multinomial). §7.3–7.5 develop COMPUTATIONAL machinery (Metropolis, Gibbs, HMC) for when conjugacy fails.

References

  • Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.). CRC. (The standard Bayesian textbook.)
  • Jaynes, E.T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. (The philosophical foundation of objective Bayes.)
  • McElreath, R. (2020). Statistical Rethinking (2nd ed.). CRC. (Accessible undergraduate-level introduction with R/Stan code.)
  • Laplace, P.S. (1814). Essai philosophique sur les probabilités. (Original rule of succession.)
  • Jeffreys, H. (1946). "An invariant form for the prior probability in estimation problems." Proc. Royal Society A 186(1007), 453–461. (Jeffreys prior.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.