Hamiltonian Monte Carlo intuition

Part 7 — Bayesian methods

Learning objectives

State HMC's central idea: introduce auxiliary momentum and evolve via Hamiltonian dynamics conserving H = U + K
Identify the LEAPFROG integrator and its symplectic, time-reversible properties
Write the HMC Metropolis acceptance ratio that corrects for leapfrog discretisation error
Recognise NUTS (Hoffman & Gelman 2014) as the auto-tuning HMC variant powering Stan / PyMC / NumPyro
Diagnose HMC pathologies: divergent transitions, low E-BFMI, funnel geometry

Random-walk Metropolis (§7.3) takes axis-blind diffusive steps; Gibbs (§7.4) takes axis-aligned conditional steps. Both struggle on POSTERIORS WITH STRONG CORRELATION and HIGH DIMENSION because they explore by random walk — covering distance $\sqrt{n}$ in n steps. HAMILTONIAN MONTE CARLO (Duane et al. 1987, brought to statistics by Neal 1996/2011) uses GRADIENT INFORMATION of the log-posterior to take coherent, long-range moves that follow the posterior's geometry. The result: orders of magnitude better mixing on the hard problems that matter.

The physics analogy

HMC treats Bayesian sampling as a problem in classical mechanics:

Parameters $\theta$ play the role of POSITION.
An auxiliary MOMENTUM variable $p$ is introduced (independent of θ), with $p \sim \mathcal{N}(0, M)$ where $M$ is a "mass matrix" (usually the identity initially).
The negative log-posterior $U(\theta) = -\log \pi(\theta)$ is the POTENTIAL ENERGY.
The KINETIC ENERGY is $K(p) = \tfrac{1}{2} p^T M^{-1} p$ .
The HAMILTONIAN $H(\theta, p) = U(\theta) + K(p)$ governs the dynamics.

The joint $(\theta, p)$ distribution is $\propto \exp(-H) = \exp(-U) \exp(-K) = \pi(\theta) , \mathcal{N}(p \mid 0, M)$ . Marginalising out p recovers π(θ). Hamilton's equations:

\frac{d\theta}{dt} = M^{-1} p, \quad \frac{dp}{dt} = -\nabla_\theta U(\theta) = \nabla_\theta \log \pi(\theta).

Under EXACT Hamiltonian dynamics, H is CONSERVED (the trajectory moves on a level set of H). A trajectory that starts at $(\theta_0, p_0)$ and evolves for time $T$ reaches some $(\theta_T, p_T)$ with the SAME H. Crucially, the trajectory traces a path that follows the local GEOMETRY of the posterior — naturally moving along contours of equal density.

The leapfrog integrator

Hamilton's equations have no closed-form solution in general; we discretise. The LEAPFROG INTEGRATOR steps in three sub-steps of size $\epsilon$ :

Half-step in momentum: $p \leftarrow p - \tfrac{\epsilon}{2} \nabla U(\theta)$
Full step in position: $\theta \leftarrow \theta + \epsilon M^{-1} p$
Half-step in momentum: $p \leftarrow p - \tfrac{\epsilon}{2} \nabla U(\theta)$

Iterate this L times to advance the trajectory by $T = L \epsilon$ . Leapfrog is

Symplectic: preserves the canonical volume in $(\theta, p)$ space exactly.
Time-reversible: reversing the momentum at the end and running L more steps returns exactly to $(\theta_0, -p_0)$ .
2nd-order accurate: error per step is $O(\epsilon^3)$ ; total error after T/ε steps is $O(\epsilon^2)$ .

These properties are essential for HMC's correctness — they ensure the proposal distribution is symmetric and the trajectory stays close to the true Hamiltonian flow.

The HMC iteration

Sample new momentum $p_0 \sim \mathcal{N}(0, M)$ (this resets the kinetic energy each iteration).
Compute $H_0 = U(\theta_0) + K(p_0)$ .
Simulate L leapfrog steps to reach $(\theta^$ .
Negate the momentum (to enforce time-reversibility): $p^* \leftarrow -p^*$ . (For the acceptance ratio it doesn't matter; the kinetic energy depends only on |p|.)
Compute $H_1 = U(\theta^$ .
Accept $\theta^$ with probability $\min(1, \exp(H_0 - H_1))$ . If accepted, set $\theta \leftarrow \theta^$ $θ \leftarrow θ^{*}$ . Else reject and stay at the old θ.

Under exact Hamiltonian dynamics, $H_0 = H_1$ and the acceptance probability is 1. Leapfrog introduces discretisation error of size $O(\epsilon^2)$ ; the Metropolis-Hastings correction recovers detailed balance with $\exp(-H) = \pi(\theta) \mathcal{N}(p \mid 0, M)$ . The marginal on θ is exactly π.

The benefit: trajectories follow geometry

Because leapfrog uses the GRADIENT of log π, each step is biased toward higher-density regions while the momentum maintains long-range coherence. The trajectory looks like a smooth arc along the contours of the posterior — exactly what you want for an elongated, correlated target. Random-walk Metropolis on the same target zig-zags in tiny steps, taking $O(d^2 / \rho^2)$ steps to traverse a length where HMC takes O(1) trajectories.

The trade-off: each HMC iteration requires L gradient evaluations of log π. For a typical L = 30, that's 30× the per-iteration cost of one Metropolis step — but the effective sample size per iteration is often 100–1000× higher on correlated targets.

Tuning ε and L

ε too small → trajectory barely moves → small effective step size → wasted gradient evaluations. ε too large → leapfrog error blows up → low acceptance rate → wasted compute. L too small → momentum is too quickly reset → behaviour reverts to random walk. L too large → trajectory loops back on itself (U-turn) → wasted compute.

The MODERN auto-tuning algorithm: NUTS (Hoffman & Gelman 2014). NUTS doubles the trajectory length adaptively until detecting a U-turn (where the trajectory starts curving back on itself); the trajectory is then truncated. Combined with dual-averaging adaptation of ε during burn-in, NUTS makes HMC nearly tuning-free. Stan, PyMC, and NumPyro all default to NUTS.

Pathologies and diagnostics

Divergent transitions: when leapfrog error explodes (often in regions of extreme curvature), the chain's trajectory goes to infinity. Stan reports divergences as warnings. Many divergences = serious problem with the parameterisation (e.g., funnel geometry in hierarchical models). Re-parameterise (e.g., Matthew Hoffman's non-centred parameterisation for hierarchical models).
E-BFMI (Energy-Bayesian Fraction of Missing Information): a diagnostic for whether the momentum-resampling step is mixing well across energy levels. E-BFMI < 0.3 indicates problems.
Funnel geometry: hierarchical models with prior $\theta \mid \sigma \sim \mathcal{N}(0, \sigma^2)$ and $\sigma \sim \text{small range}$ create a funnel that HMC struggles with at the narrow neck. Non-centred parameterisation (sampling $\tilde{\theta} = \theta/\sigma$ ) flattens the funnel.

Try it

Start with ρ = 0.92, ε = 0.08, L = 30 (the defaults). Click "1 HMC iter" several times. Watch the green arc — the leapfrog trajectory. Each iteration draws a fresh momentum and proposes the trajectory's endpoint. The trajectory FOLLOWS the contour ellipse rather than ignoring it.
Click "Run 100 HMC iter" once. Then click "Run 100 MH iter". Both samplers have had 100 iterations on the same target. Look at the cloud of points: HMC's blue dots are spread across the ellipse; MH's red dots are clustered near the starting point.
Compare the empirical correlations in the readout. HMC's empirical correlation should approach the true ρ = 0.92 within 100 iterations; MH's typically requires many more iterations.
Drag ε to 0.30 (too big). Run 100 HMC iter. Acceptance rate drops dramatically — leapfrog error has blown up. Drag ε back to 0.01 (too small) and rerun: acceptance is high (~99%) but the trajectory barely moves.
Drag L to 5. Run 100. Trajectories are short. Performance is closer to random-walk. Drag L to 80. Trajectories may loop back (U-turn waste). The Hoffman-Gelman NUTS algorithm picks an L for you per iteration that avoids both pathologies.

A Stan model reports 50 divergent transitions out of 4000 post-warmup iterations. The user's first response should be:

What you now know

HMC reformulates Bayesian sampling as Hamiltonian dynamics: position θ + momentum p, kinetic + potential energy, leapfrog integrator. Trajectories follow the geometry; Metropolis correction maintains exactness. NUTS (Hoffman-Gelman 2014) auto-tunes ε and L, making HMC the default sampler for nontrivial Bayesian models in Stan / PyMC / NumPyro. Divergent transitions and funnel geometry are the chief pathologies; non-centred parameterisation is the workhorse remedy. §§7.6 (posterior predictive checks) and 7.7 (Bayes factors / WAIC) develop the diagnostic and comparison tools that wrap modern Bayesian inference.

References

Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D. (1987). "Hybrid Monte Carlo." Physics Letters B 195(2), 216–222. (The original physics paper.)
Neal, R.M. (2011). "MCMC using Hamiltonian dynamics." In Handbook of Markov Chain Monte Carlo, Brooks et al. (eds.), pp. 113–162. CRC. (The standard reference; brought HMC to statistics.)
Hoffman, M.D., Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo." JMLR 15(1), 1593–1623. (NUTS.)
Betancourt, M. (2017). "A conceptual introduction to Hamiltonian Monte Carlo." arXiv:1701.02434. (The clearest pedagogical treatment.)
Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., Riddell, A. (2017). "Stan: A probabilistic programming language." J. Statistical Software 76(1). (Stan, the HMC workhorse.)