Metropolis–Hastings by hand

Part 7 — Bayesian methods

Learning objectives

State the Metropolis acceptance ratio α = min(1, π(θ*) / π(θ)) for symmetric proposals
Extend to the Hastings ratio α = min(1, π(θ*) q(θ | θ*) / (π(θ) q(θ* | θ))) for asymmetric proposals
Justify detailed balance and how it implies the target as the stationary distribution
Diagnose acceptance-rate pathologies (chain stuck vs chain bouncing) and tune step size into the 20–50% acceptance window
Recognise common failure modes: mode trapping, slow mixing, autocorrelation

When the target posterior $p(\theta \mid D)$ is non-conjugate — logistic regression, mixture models, hierarchical models, anything where closed-form integration fails — you can't evaluate the posterior's normalising constant. But you usually CAN evaluate the unnormalised density $\tilde{\pi}(\theta) = p(D \mid \theta) , p(\theta)$ at any point. METROPOLIS-HASTINGS exploits exactly this: build a Markov chain whose stationary distribution is the target, by sampling proposals you can compute and accepting them with a specific probability.

The algorithm

Given current state $\theta_t$ :

Sample a proposal $\theta^* \sim q(\cdot \mid \theta_t)$ from a tractable PROPOSAL distribution (often $\mathcal{N}(\theta_t, \sigma^2)$ , called the "random walk" Metropolis).
Compute the ACCEPTANCE RATIO $\alpha = \min\left(1, , \frac{\tilde{\pi}(\theta^$ When q is symmetric (e.g. Gaussian random walk), q-terms cancel and α reduces to $\min(1, \tilde{\pi}(\theta^*) / \tilde{\pi}(\theta_t))$ .
Sample $u \sim U(0, 1)$ . If $u \le \alpha$ , accept: $\theta_{t+1} \leftarrow \theta^*$ . Else, reject: $\theta_{t+1} \leftarrow \theta_t$ .

Notice three things: (a) only the unnormalised density appears in the ratio — the normalising constant cancels; (b) you always accept moves to HIGHER-density regions; (c) moves to LOWER density are accepted with probability proportional to the density ratio.

The chain produces samples $\theta_1, \theta_2, \ldots$ that, asymptotically, are distributed according to π. Standard practice: discard the first 10–50% as BURN-IN (the chain hasn't reached the stationary distribution yet), then use the remaining samples to estimate posterior summaries.

Why it works: detailed balance

A transition kernel $K(\theta, \theta')$ has stationary distribution π if

\pi(\theta) \, K(\theta, \theta') = \pi(\theta') \, K(\theta', \theta) \quad \forall \theta, \theta'.

This is the DETAILED BALANCE condition. The Metropolis transition kernel $K(\theta, \theta') = q(\theta' \mid \theta) , \alpha(\theta, \theta')$ for $\theta' \ne \theta$ satisfies detailed balance with π exactly because of how α is defined — this is the proof. From detailed balance, π is the stationary distribution. Some mild ergodicity conditions (the chain can reach every region in finite time) then ensure convergence.

Choosing the proposal: the step-size trade-off

Step size too SMALL → proposals are very close to current state → almost always accepted (high acceptance rate) → but the chain barely moves → slow exploration, high autocorrelation, may never escape one mode of a multi-modal posterior.

Step size too LARGE → proposals are far from current state → usually in low-density regions → mostly rejected → chain gets stuck on the same value repeatedly → also slow exploration.

SWEET SPOT: tune the step size so the acceptance rate is in the 20–50% range. Roberts, Gelman, & Gilks (1997) and Roberts & Rosenthal (2001) derived OPTIMAL acceptance rates under various conditions: roughly 44% for 1D, 35% for 2D, ~25% for 4D+, and 23.4% asymptotically (Roberts-Rosenthal limit). Tune the proposal scale by trial — typically with an adaptive scheme during burn-in.

Mode trapping

When the posterior is MULTI-MODAL (two or more well-separated modes), Metropolis with a small step size will tend to GET STUCK in one mode. The chain's autocorrelation grows; the empirical sample weighting between modes is wrong. Diagnostics: run MULTIPLE chains from different starting points and check whether they converge to similar histograms (the Gelman-Rubin R-hat statistic). If different chains explore different modes, you have a mode-trapping problem.

Solutions: parallel tempering (multiple chains at varying "temperatures"), reversible-jump MCMC, or specifically designed proposals (e.g., random-walk + occasional larger Lévy-flight jumps). For modern non-trivial models, Hamiltonian Monte Carlo (§7.5) is dramatically better at exploring multi-modal targets.

Hastings ratio for asymmetric proposals

When the proposal kernel $q$ is asymmetric — e.g., a proposal whose mean depends on the current state in a non-symmetric way (gradient-based Langevin proposals, importance sampling-like proposals) — the q-terms in the acceptance ratio do NOT cancel. The general HASTINGS acceptance ratio is

\alpha = \min\left(1, \, \frac{\tilde{\pi}(\theta^*) \, q(\theta_t \mid \theta^*)}{\tilde{\pi}(\theta_t) \, q(\theta^* \mid \theta_t)}\right).

This ratio guarantees detailed balance with π even for asymmetric q. Used heavily in adaptive MCMC, Metropolis-adjusted Langevin algorithm (MALA), and Hamiltonian Monte Carlo.

Diagnostics

Trace plots: plot θ_t against t. Look for the chain visiting all modes, no obvious drifts, stable variance.
Acceptance rate: 20–50% target window.
Effective sample size (ESS): due to autocorrelation, n MCMC samples are worth only ESS < n independent samples. Report ESS for the posterior summary; reject if ESS is too small for the reported posterior precision.
Gelman-Rubin R-hat: run multiple chains from over-dispersed starting points; compare within-chain to between-chain variance. R-hat near 1 indicates convergence; R-hat > 1.05 indicates problems.
Geweke z-score: compare early to late portions of a chain.

Try it

Start with step σ = 0.8. Click "Run 200 iter" several times. Watch the chain visit both modes of the bimodal target. Acceptance rate sits in the green 20–50% range. The histogram (blue bars) gradually converges to the green target density.
Now drag step to 0.10 (very small). Reset chain. Run 200. The chain barely moves — high acceptance (~95%), but it's trapped near θ = −2.5 (the starting mode), never visiting the +2.0 mode. The histogram is wildly wrong. This is the mode-trapping pathology.
Now drag step to 4.0 (very large). Reset, run 200. Most proposals are rejected; the chain holds the same value many times. Acceptance < 5%. The chain is "stuck" but for a different reason: rejections, not gravity.
Tune step to 1.5 or so. Acceptance ≈ 25–35%. Run 1000 iterations (Run 200 ×5). The histogram now matches the target much better. The empirical fraction of samples in the left mode approaches the true 40%.
Step through one iteration at a time (button "Step 1 iteration"). Watch the proposal (grey or green dashed line). Trace the logic: proposed θ*, density at θ*, density at current θ, ratio, accept/reject decision. By the third or fourth step, the algorithm should feel mechanical.

Suppose you run Metropolis on a target with 95% acceptance rate after tuning. Should you make the step size LARGER or SMALLER, and why?

What you now know

Metropolis-Hastings turns "I can evaluate the unnormalised target" into "I can sample from the target". The acceptance ratio uses only density ratios; the normalising constant cancels. Detailed balance is the proof. Tune proposal step to 20–50% acceptance; diagnose with trace plots, acceptance rate, ESS, Gelman-Rubin R-hat. Mode trapping is the chief failure mode for multi-modal targets; §§7.4 (Gibbs) and 7.5 (HMC) are alternatives, and parallel tempering is a generic remedy. §7.4 next: when one coordinate's conditional posterior is conjugate or analytic, Gibbs sampling sidesteps the Metropolis acceptance question entirely.

References

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E. (1953). "Equation of state calculations by fast computing machines." J. Chem. Phys. 21(6), 1087–1092. (The foundational paper.)
Hastings, W.K. (1970). "Monte Carlo sampling methods using Markov chains and their applications." Biometrika 57(1), 97–109. (The general Hastings ratio.)
Roberts, G.O., Gelman, A., Gilks, W.R. (1997). "Weak convergence and optimal scaling of random walk Metropolis algorithms." Annals of Applied Probability 7(1), 110–120.
Roberts, G.O., Rosenthal, J.S. (2001). "Optimal scaling for various Metropolis-Hastings algorithms." Statistical Science 16(4), 351–367. (The 0.234 result.)
Brooks, S., Gelman, A., Jones, G.L., Meng, X.L. (eds.) (2011). Handbook of Markov Chain Monte Carlo. CRC. (The comprehensive reference.)