Metropolis–Hastings by hand

Part 7 — Bayesian methods

Learning objectives

  • State the Metropolis acceptance ratio α = min(1, π(θ*) / π(θ)) for symmetric proposals
  • Extend to the Hastings ratio α = min(1, π(θ*) q(θ | θ*) / (π(θ) q(θ* | θ))) for asymmetric proposals
  • Justify detailed balance and how it implies the target as the stationary distribution
  • Diagnose acceptance-rate pathologies (chain stuck vs chain bouncing) and tune step size into the 20–50% acceptance window
  • Recognise common failure modes: mode trapping, slow mixing, autocorrelation

When the target posterior p(θD)p(\theta \mid D) is non-conjugate — logistic regression, mixture models, hierarchical models, anything where closed-form integration fails — you can't evaluate the posterior's normalising constant. But you usually CAN evaluate the unnormalised density π~(θ)=p(Dθ)p(θ)\tilde{\pi}(\theta) = p(D \mid \theta) , p(\theta) at any point. METROPOLIS-HASTINGS exploits exactly this: build a Markov chain whose stationary distribution is the target, by sampling proposals you can compute and accepting them with a specific probability.

The algorithm

Given current state θt\theta_t:

  • Sample a proposal θq(θt)\theta^* \sim q(\cdot \mid \theta_t) from a tractable PROPOSAL distribution (often N(θt,σ2)\mathcal{N}(\theta_t, \sigma^2), called the "random walk" Metropolis).
  • Compute the ACCEPTANCE RATIO α=min(1,π~(θ)q(θtθ)π~(θt)q(θθt)).\alpha = \min\left(1, , \frac{\tilde{\pi}(\theta^) , q(\theta_t \mid \theta^)}{\tilde{\pi}(\theta_t) , q(\theta^* \mid \theta_t)}\right). When q is symmetric (e.g. Gaussian random walk), q-terms cancel and α reduces to min(1,π~(θ)/π~(θt))\min(1, \tilde{\pi}(\theta^*) / \tilde{\pi}(\theta_t)).
  • Sample uU(0,1)u \sim U(0, 1). If uαu \le \alpha, accept: θt+1θ\theta_{t+1} \leftarrow \theta^*. Else, reject: θt+1θt\theta_{t+1} \leftarrow \theta_t.

Notice three things: (a) only the unnormalised density appears in the ratio — the normalising constant cancels; (b) you always accept moves to HIGHER-density regions; (c) moves to LOWER density are accepted with probability proportional to the density ratio.

The chain produces samples θ1,θ2,\theta_1, \theta_2, \ldots that, asymptotically, are distributed according to π. Standard practice: discard the first 10–50% as BURN-IN (the chain hasn't reached the stationary distribution yet), then use the remaining samples to estimate posterior summaries.

Why it works: detailed balance

A transition kernel K(θ,θ)K(\theta, \theta') has stationary distribution π if

π(θ)K(θ,θ)=π(θ)K(θ,θ)θ,θ.\pi(\theta) \, K(\theta, \theta') = \pi(\theta') \, K(\theta', \theta) \quad \forall \theta, \theta'.

This is the DETAILED BALANCE condition. The Metropolis transition kernel K(θ,θ)=q(θθ)α(θ,θ)K(\theta, \theta') = q(\theta' \mid \theta) , \alpha(\theta, \theta') for θθ\theta' \ne \theta satisfies detailed balance with π exactly because of how α is defined — this is the proof. From detailed balance, π is the stationary distribution. Some mild ergodicity conditions (the chain can reach every region in finite time) then ensure convergence.

Choosing the proposal: the step-size trade-off

Step size too SMALL → proposals are very close to current state → almost always accepted (high acceptance rate) → but the chain barely moves → slow exploration, high autocorrelation, may never escape one mode of a multi-modal posterior.

Step size too LARGE → proposals are far from current state → usually in low-density regions → mostly rejected → chain gets stuck on the same value repeatedly → also slow exploration.

SWEET SPOT: tune the step size so the acceptance rate is in the 20–50% range. Roberts, Gelman, & Gilks (1997) and Roberts & Rosenthal (2001) derived OPTIMAL acceptance rates under various conditions: roughly 44% for 1D, 35% for 2D, ~25% for 4D+, and 23.4% asymptotically (Roberts-Rosenthal limit). Tune the proposal scale by trial — typically with an adaptive scheme during burn-in.

Mode trapping

When the posterior is MULTI-MODAL (two or more well-separated modes), Metropolis with a small step size will tend to GET STUCK in one mode. The chain's autocorrelation grows; the empirical sample weighting between modes is wrong. Diagnostics: run MULTIPLE chains from different starting points and check whether they converge to similar histograms (the Gelman-Rubin R-hat statistic). If different chains explore different modes, you have a mode-trapping problem.

Solutions: parallel tempering (multiple chains at varying "temperatures"), reversible-jump MCMC, or specifically designed proposals (e.g., random-walk + occasional larger Lévy-flight jumps). For modern non-trivial models, Hamiltonian Monte Carlo (§7.5) is dramatically better at exploring multi-modal targets.

Hastings ratio for asymmetric proposals

When the proposal kernel qq is asymmetric — e.g., a proposal whose mean depends on the current state in a non-symmetric way (gradient-based Langevin proposals, importance sampling-like proposals) — the q-terms in the acceptance ratio do NOT cancel. The general HASTINGS acceptance ratio is

α=min(1,π~(θ)q(θtθ)π~(θt)q(θθt)).\alpha = \min\left(1, \, \frac{\tilde{\pi}(\theta^*) \, q(\theta_t \mid \theta^*)}{\tilde{\pi}(\theta_t) \, q(\theta^* \mid \theta_t)}\right).

This ratio guarantees detailed balance with π even for asymmetric q. Used heavily in adaptive MCMC, Metropolis-adjusted Langevin algorithm (MALA), and Hamiltonian Monte Carlo.

Diagnostics

  • Trace plots: plot θ_t against t. Look for the chain visiting all modes, no obvious drifts, stable variance.
  • Acceptance rate: 20–50% target window.
  • Effective sample size (ESS): due to autocorrelation, n MCMC samples are worth only ESS < n independent samples. Report ESS for the posterior summary; reject if ESS is too small for the reported posterior precision.
  • Gelman-Rubin R-hat: run multiple chains from over-dispersed starting points; compare within-chain to between-chain variance. R-hat near 1 indicates convergence; R-hat > 1.05 indicates problems.
  • Geweke z-score: compare early to late portions of a chain.

Metropolis Hastings DemoInteractive figure — enable JavaScript to interact.

Try it

  • Start with step σ = 0.8. Click "Run 200 iter" several times. Watch the chain visit both modes of the bimodal target. Acceptance rate sits in the green 20–50% range. The histogram (blue bars) gradually converges to the green target density.
  • Now drag step to 0.10 (very small). Reset chain. Run 200. The chain barely moves — high acceptance (~95%), but it's trapped near θ = −2.5 (the starting mode), never visiting the +2.0 mode. The histogram is wildly wrong. This is the mode-trapping pathology.
  • Now drag step to 4.0 (very large). Reset, run 200. Most proposals are rejected; the chain holds the same value many times. Acceptance < 5%. The chain is "stuck" but for a different reason: rejections, not gravity.
  • Tune step to 1.5 or so. Acceptance ≈ 25–35%. Run 1000 iterations (Run 200 ×5). The histogram now matches the target much better. The empirical fraction of samples in the left mode approaches the true 40%.
  • Step through one iteration at a time (button "Step 1 iteration"). Watch the proposal (grey or green dashed line). Trace the logic: proposed θ*, density at θ*, density at current θ, ratio, accept/reject decision. By the third or fourth step, the algorithm should feel mechanical.

Suppose you run Metropolis on a target with 95% acceptance rate after tuning. Should you make the step size LARGER or SMALLER, and why?

What you now know

Metropolis-Hastings turns "I can evaluate the unnormalised target" into "I can sample from the target". The acceptance ratio uses only density ratios; the normalising constant cancels. Detailed balance is the proof. Tune proposal step to 20–50% acceptance; diagnose with trace plots, acceptance rate, ESS, Gelman-Rubin R-hat. Mode trapping is the chief failure mode for multi-modal targets; §§7.4 (Gibbs) and 7.5 (HMC) are alternatives, and parallel tempering is a generic remedy. §7.4 next: when one coordinate's conditional posterior is conjugate or analytic, Gibbs sampling sidesteps the Metropolis acceptance question entirely.

References

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E. (1953). "Equation of state calculations by fast computing machines." J. Chem. Phys. 21(6), 1087–1092. (The foundational paper.)
  • Hastings, W.K. (1970). "Monte Carlo sampling methods using Markov chains and their applications." Biometrika 57(1), 97–109. (The general Hastings ratio.)
  • Roberts, G.O., Gelman, A., Gilks, W.R. (1997). "Weak convergence and optimal scaling of random walk Metropolis algorithms." Annals of Applied Probability 7(1), 110–120.
  • Roberts, G.O., Rosenthal, J.S. (2001). "Optimal scaling for various Metropolis-Hastings algorithms." Statistical Science 16(4), 351–367. (The 0.234 result.)
  • Brooks, S., Gelman, A., Jones, G.L., Meng, X.L. (eds.) (2011). Handbook of Markov Chain Monte Carlo. CRC. (The comprehensive reference.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.