Metropolis–Hastings by hand
Learning objectives
- State the Metropolis acceptance ratio α = min(1, π(θ*) / π(θ)) for symmetric proposals
- Extend to the Hastings ratio α = min(1, π(θ*) q(θ | θ*) / (π(θ) q(θ* | θ))) for asymmetric proposals
- Justify detailed balance and how it implies the target as the stationary distribution
- Diagnose acceptance-rate pathologies (chain stuck vs chain bouncing) and tune step size into the 20–50% acceptance window
- Recognise common failure modes: mode trapping, slow mixing, autocorrelation
When the target posterior is non-conjugate — logistic regression, mixture models, hierarchical models, anything where closed-form integration fails — you can't evaluate the posterior's normalising constant. But you usually CAN evaluate the unnormalised density at any point. METROPOLIS-HASTINGS exploits exactly this: build a Markov chain whose stationary distribution is the target, by sampling proposals you can compute and accepting them with a specific probability.
The algorithm
Given current state :
- Sample a proposal from a tractable PROPOSAL distribution (often , called the "random walk" Metropolis).
- Compute the ACCEPTANCE RATIO When q is symmetric (e.g. Gaussian random walk), q-terms cancel and α reduces to .
- Sample . If , accept: . Else, reject: .
Notice three things: (a) only the unnormalised density appears in the ratio — the normalising constant cancels; (b) you always accept moves to HIGHER-density regions; (c) moves to LOWER density are accepted with probability proportional to the density ratio.
The chain produces samples that, asymptotically, are distributed according to π. Standard practice: discard the first 10–50% as BURN-IN (the chain hasn't reached the stationary distribution yet), then use the remaining samples to estimate posterior summaries.
Why it works: detailed balance
A transition kernel has stationary distribution π if
This is the DETAILED BALANCE condition. The Metropolis transition kernel for satisfies detailed balance with π exactly because of how α is defined — this is the proof. From detailed balance, π is the stationary distribution. Some mild ergodicity conditions (the chain can reach every region in finite time) then ensure convergence.
Choosing the proposal: the step-size trade-off
Step size too SMALL → proposals are very close to current state → almost always accepted (high acceptance rate) → but the chain barely moves → slow exploration, high autocorrelation, may never escape one mode of a multi-modal posterior.
Step size too LARGE → proposals are far from current state → usually in low-density regions → mostly rejected → chain gets stuck on the same value repeatedly → also slow exploration.
SWEET SPOT: tune the step size so the acceptance rate is in the 20–50% range. Roberts, Gelman, & Gilks (1997) and Roberts & Rosenthal (2001) derived OPTIMAL acceptance rates under various conditions: roughly 44% for 1D, 35% for 2D, ~25% for 4D+, and 23.4% asymptotically (Roberts-Rosenthal limit). Tune the proposal scale by trial — typically with an adaptive scheme during burn-in.
Mode trapping
When the posterior is MULTI-MODAL (two or more well-separated modes), Metropolis with a small step size will tend to GET STUCK in one mode. The chain's autocorrelation grows; the empirical sample weighting between modes is wrong. Diagnostics: run MULTIPLE chains from different starting points and check whether they converge to similar histograms (the Gelman-Rubin R-hat statistic). If different chains explore different modes, you have a mode-trapping problem.
Solutions: parallel tempering (multiple chains at varying "temperatures"), reversible-jump MCMC, or specifically designed proposals (e.g., random-walk + occasional larger Lévy-flight jumps). For modern non-trivial models, Hamiltonian Monte Carlo (§7.5) is dramatically better at exploring multi-modal targets.
Hastings ratio for asymmetric proposals
When the proposal kernel is asymmetric — e.g., a proposal whose mean depends on the current state in a non-symmetric way (gradient-based Langevin proposals, importance sampling-like proposals) — the q-terms in the acceptance ratio do NOT cancel. The general HASTINGS acceptance ratio is
This ratio guarantees detailed balance with π even for asymmetric q. Used heavily in adaptive MCMC, Metropolis-adjusted Langevin algorithm (MALA), and Hamiltonian Monte Carlo.
Diagnostics
- Trace plots: plot θ_t against t. Look for the chain visiting all modes, no obvious drifts, stable variance.
- Acceptance rate: 20–50% target window.
- Effective sample size (ESS): due to autocorrelation, n MCMC samples are worth only ESS < n independent samples. Report ESS for the posterior summary; reject if ESS is too small for the reported posterior precision.
- Gelman-Rubin R-hat: run multiple chains from over-dispersed starting points; compare within-chain to between-chain variance. R-hat near 1 indicates convergence; R-hat > 1.05 indicates problems.
- Geweke z-score: compare early to late portions of a chain.
Try it
- Start with step σ = 0.8. Click "Run 200 iter" several times. Watch the chain visit both modes of the bimodal target. Acceptance rate sits in the green 20–50% range. The histogram (blue bars) gradually converges to the green target density.
- Now drag step to 0.10 (very small). Reset chain. Run 200. The chain barely moves — high acceptance (~95%), but it's trapped near θ = −2.5 (the starting mode), never visiting the +2.0 mode. The histogram is wildly wrong. This is the mode-trapping pathology.
- Now drag step to 4.0 (very large). Reset, run 200. Most proposals are rejected; the chain holds the same value many times. Acceptance < 5%. The chain is "stuck" but for a different reason: rejections, not gravity.
- Tune step to 1.5 or so. Acceptance ≈ 25–35%. Run 1000 iterations (Run 200 ×5). The histogram now matches the target much better. The empirical fraction of samples in the left mode approaches the true 40%.
- Step through one iteration at a time (button "Step 1 iteration"). Watch the proposal (grey or green dashed line). Trace the logic: proposed θ*, density at θ*, density at current θ, ratio, accept/reject decision. By the third or fourth step, the algorithm should feel mechanical.
Suppose you run Metropolis on a target with 95% acceptance rate after tuning. Should you make the step size LARGER or SMALLER, and why?
What you now know
Metropolis-Hastings turns "I can evaluate the unnormalised target" into "I can sample from the target". The acceptance ratio uses only density ratios; the normalising constant cancels. Detailed balance is the proof. Tune proposal step to 20–50% acceptance; diagnose with trace plots, acceptance rate, ESS, Gelman-Rubin R-hat. Mode trapping is the chief failure mode for multi-modal targets; §§7.4 (Gibbs) and 7.5 (HMC) are alternatives, and parallel tempering is a generic remedy. §7.4 next: when one coordinate's conditional posterior is conjugate or analytic, Gibbs sampling sidesteps the Metropolis acceptance question entirely.
References
- Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E. (1953). "Equation of state calculations by fast computing machines." J. Chem. Phys. 21(6), 1087–1092. (The foundational paper.)
- Hastings, W.K. (1970). "Monte Carlo sampling methods using Markov chains and their applications." Biometrika 57(1), 97–109. (The general Hastings ratio.)
- Roberts, G.O., Gelman, A., Gilks, W.R. (1997). "Weak convergence and optimal scaling of random walk Metropolis algorithms." Annals of Applied Probability 7(1), 110–120.
- Roberts, G.O., Rosenthal, J.S. (2001). "Optimal scaling for various Metropolis-Hastings algorithms." Statistical Science 16(4), 351–367. (The 0.234 result.)
- Brooks, S., Gelman, A., Jones, G.L., Meng, X.L. (eds.) (2011). Handbook of Markov Chain Monte Carlo. CRC. (The comprehensive reference.)