Gibbs sampling on a 2D example

Part 7 — Bayesian methods

Learning objectives

Define a GIBBS SAMPLER as a Markov chain that updates one (or one block of) coordinate(s) at a time from the FULL CONDITIONAL distribution
Construct a Gibbs sampler for the bivariate Normal using its closed-form conditionals
Recognise that Gibbs is a special case of Metropolis-Hastings where the acceptance ratio is identically 1
Diagnose SLOW MIXING when the parameters are highly correlated (anisotropic posteriors)
Apply BLOCKED Gibbs and parameter-augmentation techniques to overcome correlation-induced slow mixing

When the joint posterior $p(\theta_1, \ldots, \theta_K \mid D)$ is intractable but the FULL CONDITIONALS $p(\theta_k \mid \theta_{-k}, D)$ are each tractable, GIBBS SAMPLING beats Metropolis. Gibbs sweeps through coordinates, sampling each one from its conditional given the others. Because each step samples EXACTLY from the conditional, there is no acceptance ratio; every move is always accepted. The chain is constructed automatically; the work is in deriving the conditionals.

The algorithm

Given current state $(\theta_1^{(t)}, \theta_2^{(t)}, \ldots, \theta_K^{(t)})$ :

Sample $\theta_1^{(t+1)} \sim p(\theta_1 \mid \theta_2^{(t)}, \ldots, \theta_K^{(t)}, D)$
Sample $\theta_2^{(t+1)} \sim p(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots, \theta_K^{(t)}, D)$
...
Sample $\theta_K^{(t+1)} \sim p(\theta_K \mid \theta_1^{(t+1)}, \ldots, \theta_{K-1}^{(t+1)}, D)$

After $K$ updates the state vector has been fully refreshed; that constitutes one ITERATION. Coordinates can be updated in any order (random scan, deterministic sweep, blocked); the stationary distribution remains π.

Gibbs uses the CONDITIONALS to define a valid Markov transition: the proposal is the full conditional and the acceptance ratio collapses to 1. As a Metropolis-Hastings sampler with proposal $q = p(\theta_k \mid \theta_{-k})$ , the Hastings ratio is identically 1. Detailed balance with the joint π holds because the conditionals are derived from π itself.

The bivariate Normal example

Let $(X, Y) \sim \mathcal{N}_2(\mathbf{0}, \Sigma)$ with

\Sigma = \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}

. The conditional distributions are:

X \mid Y \sim \mathcal{N}(\rho Y, \, 1 - \rho^2), \quad Y \mid X \sim \mathcal{N}(\rho X, \, 1 - \rho^2).

One Gibbs iteration samples X from the first, then Y from the second. The L-SHAPED step pattern in the widget is characteristic: one axis-aligned move along x, then one along y.

Why Gibbs can be slow: correlation

The CONDITIONAL variance of X given Y is $(1 - \rho^2)$ . When $\rho \to 1$ , this variance shrinks to zero. Each Gibbs step moves only a small distance along the axis; the chain zig-zags along the long diagonal of the target ellipse, accumulating only tiny advances per iteration. The chain's AUTOCORRELATION grows; effective sample size shrinks; mixing slows dramatically. This is the central pathology of vanilla Gibbs: HIGHLY CORRELATED parameters take many iterations to explore.

Three remedies:

Block Gibbs: update CORRELATED PARAMETERS TOGETHER as a block, sampling from their joint conditional. For the bivariate Normal example, just sample $(X, Y)$ jointly — one step, no zig-zag.
Re-parameterise: transform variables so the posterior becomes nearly uncorrelated. For the bivariate Normal, the eigen-basis $(U, V) = ((X + Y)/\sqrt{2}, (X - Y)/\sqrt{2})$ has marginals $U \sim \mathcal{N}(0, 1 + \rho)$ and $V \sim \mathcal{N}(0, 1 - \rho)$ INDEPENDENT. Gibbs in this rotated basis converges fast regardless of ρ.
Parameter augmentation: Tanner-Wong (1987) and Gelfand-Smith (1990) introduced AUXILIARY variables that make conditionals easier. For example, the LDA model uses a latent topic variable for each word, making Gibbs updates trivial.

Where Gibbs shines: hierarchical models

The classical application is HIERARCHICAL MODELS where each layer has tractable conditional posteriors. Example: hierarchical linear regression with random effects.

y_{ij} = \alpha_j + \beta x_{ij} + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim \mathcal{N}(0, \sigma^2), \quad \alpha_j \sim \mathcal{N}(\mu_\alpha, \tau^2).

Conditionals: each $\alpha_j$ given the others is Normal (closed-form); β given α, σ² is Normal (closed-form OLS); σ² is Inverse-Gamma; μ_α is Normal; τ² is Inverse-Gamma. A Gibbs sampler updates all five components in sequence each iteration. This is exactly how BUGS, WinBUGS, JAGS (Plummer 2003), and OpenBUGS — the early Bayesian-software workhorses — performed inference for two decades.

Gibbs as data augmentation

Albert & Chib (1993) showed how to fit a probit regression with Gibbs by introducing latent variables $Z_i \sim \mathcal{N}(\mathbf{x}_i^T \beta, 1)$ with $Y_i = \mathbb{1}(Z_i > 0)$ . Conditional on Z, the regression coefficient β has a Normal conditional (closed form). Conditional on β, each Z_i is a truncated Normal. Gibbs alternates the two. This DATA AUGMENTATION trick turned previously intractable models tractable, and remains a major workhorse in modern Bayesian computing.

When NOT to use Gibbs

Conditionals are not in closed form — you'd need Metropolis-within-Gibbs (substitute Metropolis steps for the intractable conditionals).
Parameters are highly correlated and re-parameterisation is hard.
For modern high-dimensional models, HMC (§7.5) explores the joint posterior much faster than Gibbs by using gradient information.

For most modern applications, Stan / PyMC / NumPyro use HMC or NUTS as the default. Gibbs remains essential for some structured models (mixed-effects, hidden-Markov-model state inference, topic models) where the conditionals are particularly clean.

Try it

Start with ρ = 0.7 (the default). Click "Run 200 iter" several times. The blue dots accumulate along the diagonal target ellipse. The red current-state dot wanders. The recent L-shaped steps (red lines) show each iteration's x-update then y-update. Empirical mean and variance converge toward the truth.
Drag ρ to +0.95 (high correlation). Reset and run 200. The chain zig-zags slowly along the elongated diagonal. The L-shaped steps are very short — conditional SD is √(1 − 0.95²) ≈ 0.31. Many more iterations are needed for the chain to traverse the ellipse end-to-end. This is the CORRELATION TAX on Gibbs.
Drag ρ to 0.0 (independence). Reset, run 200. Now each x-update is independent of y, and vice versa. Steps are LARGE (conditional SD = 1). The chain visits the entire target ellipse rapidly. Effective sample size per iteration is much higher.
Drag ρ to −0.95 (high negative correlation). Same zig-zag pathology but now along the OTHER diagonal. The pathology is about MAGNITUDE of correlation, not sign.
Step through one iteration at a time. Watch the L-shape unfold: starting at (x, y), the x-coordinate moves to a new value (the conditional sample given current y), then the y-coordinate moves (the conditional sample given the new x). One Gibbs iteration = K coordinate sweeps (here K = 2).

A Bayesian model has parameters (μ, σ²) with closed-form conditionals μ | σ², D ~ Normal and σ² | μ, D ~ Inverse-Gamma. Sampling Gibbs is easy. But after burn-in the chain is moving SLOWLY: an iteration moves μ by ~ 0.001 standard deviations and σ² by ~ 0.001 standard deviations. What might be going wrong?

What you now know

Gibbs is the Metropolis-Hastings sampler whose proposal is the full conditional — acceptance ratio collapses to 1, and the chain converges automatically by detailed balance. Conditionals must be tractable (closed-form or efficient to sample from). Hierarchical and data-augmentation models are Gibbs' natural home; Albert-Chib (1993) probit augmentation is iconic. Highly correlated parameters slow Gibbs to a crawl; remedies are blocked Gibbs, re-parameterisation, or HMC. §7.5 develops HMC, the modern default for non-trivial high-dimensional Bayes.

References

Geman, S., Geman, D. (1984). "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images." IEEE Trans. PAMI 6(6), 721–741. (Introduces Gibbs.)
Gelfand, A.E., Smith, A.F.M. (1990). "Sampling-based approaches to calculating marginal densities." JASA 85(410), 398–409. (Brought Gibbs to mainstream Bayesian statistics.)
Tanner, M.A., Wong, W.H. (1987). "The calculation of posterior distributions by data augmentation." JASA 82(398), 528–540. (Data augmentation.)
Albert, J.H., Chib, S. (1993). "Bayesian analysis of binary and polychotomous response data." JASA 88(422), 669–679. (Probit augmentation.)
Plummer, M. (2003). "JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling." Proc. 3rd Intl. Workshop on Distributed Statistical Computing. (JAGS, the modern Gibbs sampler.)