Hamiltonian Monte Carlo intuition

Part 7 — Bayesian methods

Learning objectives

  • State HMC's central idea: introduce auxiliary momentum and evolve via Hamiltonian dynamics conserving H = U + K
  • Identify the LEAPFROG integrator and its symplectic, time-reversible properties
  • Write the HMC Metropolis acceptance ratio that corrects for leapfrog discretisation error
  • Recognise NUTS (Hoffman & Gelman 2014) as the auto-tuning HMC variant powering Stan / PyMC / NumPyro
  • Diagnose HMC pathologies: divergent transitions, low E-BFMI, funnel geometry

Random-walk Metropolis (§7.3) takes axis-blind diffusive steps; Gibbs (§7.4) takes axis-aligned conditional steps. Both struggle on POSTERIORS WITH STRONG CORRELATION and HIGH DIMENSION because they explore by random walk — covering distance n\sqrt{n} in n steps. HAMILTONIAN MONTE CARLO (Duane et al. 1987, brought to statistics by Neal 1996/2011) uses GRADIENT INFORMATION of the log-posterior to take coherent, long-range moves that follow the posterior's geometry. The result: orders of magnitude better mixing on the hard problems that matter.

The physics analogy

HMC treats Bayesian sampling as a problem in classical mechanics:

  • Parameters θ\theta play the role of POSITION.
  • An auxiliary MOMENTUM variable pp is introduced (independent of θ), with pN(0,M)p \sim \mathcal{N}(0, M) where MM is a "mass matrix" (usually the identity initially).
  • The negative log-posterior U(θ)=logπ(θ)U(\theta) = -\log \pi(\theta) is the POTENTIAL ENERGY.
  • The KINETIC ENERGY is K(p)=12pTM1pK(p) = \tfrac{1}{2} p^T M^{-1} p.
  • The HAMILTONIAN H(θ,p)=U(θ)+K(p)H(\theta, p) = U(\theta) + K(p) governs the dynamics.

The joint (θ,p)(\theta, p) distribution is exp(H)=exp(U)exp(K)=π(θ)N(p0,M)\propto \exp(-H) = \exp(-U) \exp(-K) = \pi(\theta) , \mathcal{N}(p \mid 0, M). Marginalising out p recovers π(θ). Hamilton's equations:

dθdt=M1p,dpdt=θU(θ)=θlogπ(θ).\frac{d\theta}{dt} = M^{-1} p, \quad \frac{dp}{dt} = -\nabla_\theta U(\theta) = \nabla_\theta \log \pi(\theta).

Under EXACT Hamiltonian dynamics, H is CONSERVED (the trajectory moves on a level set of H). A trajectory that starts at (θ0,p0)(\theta_0, p_0) and evolves for time TT reaches some (θT,pT)(\theta_T, p_T) with the SAME H. Crucially, the trajectory traces a path that follows the local GEOMETRY of the posterior — naturally moving along contours of equal density.

The leapfrog integrator

Hamilton's equations have no closed-form solution in general; we discretise. The LEAPFROG INTEGRATOR steps in three sub-steps of size ϵ\epsilon:

  • Half-step in momentum: ppϵ2U(θ)p \leftarrow p - \tfrac{\epsilon}{2} \nabla U(\theta)
  • Full step in position: θθ+ϵM1p\theta \leftarrow \theta + \epsilon M^{-1} p
  • Half-step in momentum: ppϵ2U(θ)p \leftarrow p - \tfrac{\epsilon}{2} \nabla U(\theta)

Iterate this L times to advance the trajectory by T=LϵT = L \epsilon. Leapfrog is

  • Symplectic: preserves the canonical volume in (θ,p)(\theta, p) space exactly.
  • Time-reversible: reversing the momentum at the end and running L more steps returns exactly to (θ0,p0)(\theta_0, -p_0).
  • 2nd-order accurate: error per step is O(ϵ3)O(\epsilon^3); total error after T/ε steps is O(ϵ2)O(\epsilon^2).

These properties are essential for HMC's correctness — they ensure the proposal distribution is symmetric and the trajectory stays close to the true Hamiltonian flow.

The HMC iteration

  • Sample new momentum p0N(0,M)p_0 \sim \mathcal{N}(0, M) (this resets the kinetic energy each iteration).
  • Compute H0=U(θ0)+K(p0)H_0 = U(\theta_0) + K(p_0).
  • Simulate L leapfrog steps to reach (θ,p)(\theta^, p^).
  • Negate the momentum (to enforce time-reversibility): ppp^* \leftarrow -p^*. (For the acceptance ratio it doesn't matter; the kinetic energy depends only on |p|.)
  • Compute H1=U(θ)+K(p)H_1 = U(\theta^) + K(p^).
  • Accept θ\theta^ with probability min(1,exp(H0H1))\min(1, \exp(H_0 - H_1)). If accepted, set θθ\theta \leftarrow \theta^. Else reject and stay at the old θ.

Under exact Hamiltonian dynamics, H0=H1H_0 = H_1 and the acceptance probability is 1. Leapfrog introduces discretisation error of size O(ϵ2)O(\epsilon^2); the Metropolis-Hastings correction recovers detailed balance with exp(H)=π(θ)N(p0,M)\exp(-H) = \pi(\theta) \mathcal{N}(p \mid 0, M). The marginal on θ is exactly π.

The benefit: trajectories follow geometry

Because leapfrog uses the GRADIENT of log π, each step is biased toward higher-density regions while the momentum maintains long-range coherence. The trajectory looks like a smooth arc along the contours of the posterior — exactly what you want for an elongated, correlated target. Random-walk Metropolis on the same target zig-zags in tiny steps, taking O(d2/ρ2)O(d^2 / \rho^2) steps to traverse a length where HMC takes O(1) trajectories.

The trade-off: each HMC iteration requires L gradient evaluations of log π. For a typical L = 30, that's 30× the per-iteration cost of one Metropolis step — but the effective sample size per iteration is often 100–1000× higher on correlated targets.

Tuning ε and L

ε too small → trajectory barely moves → small effective step size → wasted gradient evaluations. ε too large → leapfrog error blows up → low acceptance rate → wasted compute. L too small → momentum is too quickly reset → behaviour reverts to random walk. L too large → trajectory loops back on itself (U-turn) → wasted compute.

The MODERN auto-tuning algorithm: NUTS (Hoffman & Gelman 2014). NUTS doubles the trajectory length adaptively until detecting a U-turn (where the trajectory starts curving back on itself); the trajectory is then truncated. Combined with dual-averaging adaptation of ε during burn-in, NUTS makes HMC nearly tuning-free. Stan, PyMC, and NumPyro all default to NUTS.

Pathologies and diagnostics

  • Divergent transitions: when leapfrog error explodes (often in regions of extreme curvature), the chain's trajectory goes to infinity. Stan reports divergences as warnings. Many divergences = serious problem with the parameterisation (e.g., funnel geometry in hierarchical models). Re-parameterise (e.g., Matthew Hoffman's non-centred parameterisation for hierarchical models).
  • E-BFMI (Energy-Bayesian Fraction of Missing Information): a diagnostic for whether the momentum-resampling step is mixing well across energy levels. E-BFMI < 0.3 indicates problems.
  • Funnel geometry: hierarchical models with prior θσN(0,σ2)\theta \mid \sigma \sim \mathcal{N}(0, \sigma^2) and σsmall range\sigma \sim \text{small range} create a funnel that HMC struggles with at the narrow neck. Non-centred parameterisation (sampling θ~=θ/σ\tilde{\theta} = \theta/\sigma) flattens the funnel.

Hmc Intuition DemoInteractive figure — enable JavaScript to interact.

Try it

  • Start with ρ = 0.92, ε = 0.08, L = 30 (the defaults). Click "1 HMC iter" several times. Watch the green arc — the leapfrog trajectory. Each iteration draws a fresh momentum and proposes the trajectory's endpoint. The trajectory FOLLOWS the contour ellipse rather than ignoring it.
  • Click "Run 100 HMC iter" once. Then click "Run 100 MH iter". Both samplers have had 100 iterations on the same target. Look at the cloud of points: HMC's blue dots are spread across the ellipse; MH's red dots are clustered near the starting point.
  • Compare the empirical correlations in the readout. HMC's empirical correlation should approach the true ρ = 0.92 within 100 iterations; MH's typically requires many more iterations.
  • Drag ε to 0.30 (too big). Run 100 HMC iter. Acceptance rate drops dramatically — leapfrog error has blown up. Drag ε back to 0.01 (too small) and rerun: acceptance is high (~99%) but the trajectory barely moves.
  • Drag L to 5. Run 100. Trajectories are short. Performance is closer to random-walk. Drag L to 80. Trajectories may loop back (U-turn waste). The Hoffman-Gelman NUTS algorithm picks an L for you per iteration that avoids both pathologies.

A Stan model reports 50 divergent transitions out of 4000 post-warmup iterations. The user's first response should be:

What you now know

HMC reformulates Bayesian sampling as Hamiltonian dynamics: position θ + momentum p, kinetic + potential energy, leapfrog integrator. Trajectories follow the geometry; Metropolis correction maintains exactness. NUTS (Hoffman-Gelman 2014) auto-tunes ε and L, making HMC the default sampler for nontrivial Bayesian models in Stan / PyMC / NumPyro. Divergent transitions and funnel geometry are the chief pathologies; non-centred parameterisation is the workhorse remedy. §§7.6 (posterior predictive checks) and 7.7 (Bayes factors / WAIC) develop the diagnostic and comparison tools that wrap modern Bayesian inference.

References

  • Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D. (1987). "Hybrid Monte Carlo." Physics Letters B 195(2), 216–222. (The original physics paper.)
  • Neal, R.M. (2011). "MCMC using Hamiltonian dynamics." In Handbook of Markov Chain Monte Carlo, Brooks et al. (eds.), pp. 113–162. CRC. (The standard reference; brought HMC to statistics.)
  • Hoffman, M.D., Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo." JMLR 15(1), 1593–1623. (NUTS.)
  • Betancourt, M. (2017). "A conceptual introduction to Hamiltonian Monte Carlo." arXiv:1701.02434. (The clearest pedagogical treatment.)
  • Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., Riddell, A. (2017). "Stan: A probabilistic programming language." J. Statistical Software 76(1). (Stan, the HMC workhorse.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.