Poisson and negative-binomial counts
Learning objectives
- Fit Poisson regression with log link for count data
- Interpret coefficients as multiplicative effects on the count rate
- Diagnose overdispersion (Var(Y|X) > E[Y|X])
- Switch to negative-binomial when overdispersion is detected
- Apply offsets for exposure-adjusted rate models
Count outcomes — number of events in a time interval, number of accidents per region, ER admissions per day — are not Normal. Poisson regression models them.
The Poisson model
The log link is canonical for Poisson: is always positive. The variance is constrained: — variance equals mean.
Coefficient interpretation: multiplicative
For unit increase in , multiplies by :
Example: a coefficient of 0.5 means the count is 1.65× higher per unit of x_j. Coefficients on the LOG scale; effects on the RATE scale are multiplicative.
Overdispersion
Poisson assumes Var = Mean. Real count data often has Var >> Mean (overdispersion) due to unobserved heterogeneity, clustering, or temporal correlation. Diagnose via:
- Pearson dispersion: . Should be ≈ 1 under Poisson; indicates overdispersion.
- Residuals-vs-fitted plot: variance growing faster than mean.
Negative binomial: the standard fix
Negative binomial allows Var(Y) = μ + αμ². The dispersion parameter α captures over-dispersion: α=0 reduces to Poisson; larger α = more spread. Fitted via the same IRLS machinery; R: MASS::glm.nb().
Offsets: rate models
To model RATES (events per unit exposure), include log-exposure as an OFFSET — a coefficient fixed at 1. R: glm(Y ~ X, offset = log(exposure), family = poisson). The model then estimates the rate per unit exposure rather than the raw count.
Zero-inflated counts
When zero counts are over-represented (e.g., medical visits per year — many people have 0), use zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB). Mixes a point mass at 0 with the count distribution.
Hands-on dispersion diagnostics
The widget below lets you choose a true count DGP (Poisson, Negative Binomial, or Zero-Inflated Poisson), generate samples, and watch the diagnostic statistics live. The dispersion gauge flashes red when φ̂ > 2 (strong overdispersion), amber for 1.2-2 (mild), green for 0.9-1.2 (≈ Poisson). The histogram overlays both the fitted Poisson and NB PMFs so you can SEE which family fits better.
Try it
- Set DGP = Pure Poisson with μ = 4. Note φ̂ stays near 1, and the Poisson curve (blue) matches the histogram well. The Q-Q plot of Pearson residuals hugs the diagonal — Poisson is correctly specified.
- Switch DGP to Negative Binomial with μ = 4, k = 1. Notice φ̂ jumps to 4-7 (strong overdispersion). The Poisson PMF (blue) under-predicts the tails; the NB PMF (purple) tracks the histogram. Q-Q residuals fan out at the extremes.
- Crank NB k to 30. The two curves overlap and φ̂ ≈ 1 — large k is the Poisson limit. Compare with k = 0.5 (very high dispersion): the histogram becomes long-tailed and the Poisson fit completely fails.
- Switch to Zero-Inflated with μ = 4, π₀ = 0.4. The "Excess zeros" readout becomes substantially positive (≈ +0.4 - 0.4·e^-4 ≈ +0.39). Poisson PMF predicts ≈ 4 zeros for n=200 but you see ~80+. This is the zero-inflation signature; Poisson can't represent it even with the right μ.
- Set DGP = Pure Poisson but make n very small (n = 30). Re-seed several times. Notice φ̂ now fluctuates wildly, sometimes flagging "mild overdispersion" even on correctly-specified Poisson data. Small samples make dispersion diagnostics noisy — always check whether you have ENOUGH data to trust the gauge.
A reservoir engineer fits Poisson regression to "number of failed wells per field" and finds φ̂ = 5.3. The fix is "use negative binomial" — but WHY does NB rescue them? In one sentence, what physical or statistical structure does the NB's extra k parameter capture that pure Poisson cannot? (Hint: think of unmodelled heterogeneity across fields.)
References
- Cameron, A.C., Trivedi, P.K. (2013). Regression Analysis of Count Data, 2nd ed. Cambridge.
- Hilbe, J.M. (2014). Modeling Count Data. Cambridge.
- Lambert, D. (1992). "Zero-inflated Poisson regression." Technometrics 34(1), 1–14.
- McCullagh, P., Nelder, J.A. (1989). Generalized Linear Models, 2nd ed.