Kernel density estimation

Part 8 — Resampling and nonparametrics

Learning objectives

  • Define KDE: place a kernel at each data point; sum and normalise
  • Compute the optimal bandwidth via Silverman's rule of thumb
  • Recognise the BIAS-VARIANCE trade-off in bandwidth selection
  • Apply alternative bandwidth selectors: cross-validation, plug-in (Sheather-Jones)
  • Apply KDE to multimodal data and compare to histogram-based smoothing

KDE is the canonical nonparametric estimator of a continuous probability density. Given data x1,,xNFx_1, \ldots, x_N \sim F, the kernel density estimator at point x is

f^(x)=1Nhi=1NK(xxih),\hat{f}(x) = \frac{1}{N h} \sum_{i=1}^N K\left(\frac{x - x_i}{h}\right),

where KK is a kernel function (e.g., the standard Normal density) and hh is the BANDWIDTH (smoothing parameter). Geometrically, KDE places a small "bump" of width h at each data point and sums them. Choice of kernel matters less than choice of bandwidth — most reasonable kernels give similar results.

Bandwidth: the central design choice

Bandwidth controls the bias-variance trade-off:

  • Small h (undersmoothing): KDE has many wiggles, following each data point. Bias is small (each point contributes its own bump) but variance is high (the estimate depends sensitively on which observations were drawn).
  • Large h (oversmoothing): KDE is very smooth, possibly blurring real structure like multimodality. Bias is high (true sharp peaks become broad mounds); variance is low (the estimate is stable across resamples).
  • Optimal h: minimises the MEAN INTEGRATED SQUARED ERROR (MISE) E[(f^(x)f(x))2]dx\int E[(\hat{f}(x) - f(x))^2] dx. For a true Normal density, the AMISE-optimal h is h=1.06σN1/5h^* = 1.06 \sigma N^{-1/5} — SILVERMAN'S RULE OF THUMB.

Silverman's rule of thumb (1986)

The most commonly used default:

h=1.06σ^N1/5,h = 1.06 \hat{\sigma} N^{-1/5},

where σ^=min(s^,IQR/1.34)\hat{\sigma} = \min(\hat{s}, \text{IQR}/1.34) to be robust against outliers. Assumes the true density is approximately Normal — so for multimodal or heavy-tailed data, Silverman tends to OVERSMOOTH. Good "starter" bandwidth; refine via cross-validation if the data clearly aren't Normal.

Cross-validation bandwidth

Pick h to minimise the cross-validated integrated squared error:

CV(h)=f^2(x)dx2Ni=1Nf^i(xi),\text{CV}(h) = \int \hat{f}^2(x) dx - \frac{2}{N} \sum_{i=1}^N \hat{f}_{-i}(x_i),

where f^i\hat{f}_{-i} is the KDE without observation i. Minimise over h via grid search. Doesn't assume any parametric form; adapts to the data shape. R's density(x, bw = "ucv") and Python's scipy.stats.gaussian_kde(bw_method = "scott") support various selectors.

Plug-in bandwidth: Sheather-Jones

The Sheather-Jones (1991) plug-in estimator uses an iterative procedure that plugs estimates of the unknown functionals f(x)2dx\int f''(x)^2 dx into the AMISE-optimal formula. More accurate than Silverman, computationally heavier. Often the modern default in production.

Adaptive bandwidth

FIXED-bandwidth KDE uses the same h everywhere. ADAPTIVE-bandwidth KDE uses larger h in low-density regions (where data is sparse) and smaller h in high-density regions (where data is dense). Particularly useful for distributions with sharp peaks mixed with broad tails. Computational cost is higher; benefit depends on the distribution.

Boundary problems

For bounded distributions (e.g., densities on [0, ∞) or [0, 1]), KDE near the boundary is biased — the kernel "spreads probability mass" outside the support. Fix: reflection method (extend the kernel symmetrically about the boundary), boundary kernels, or transform-then-back-transform (estimate density of log(x) for x > 0, then transform back via change of variables).

Multivariate KDE

For d-dimensional data, KDE generalises: kernel matrix HH (full d × d) plus product kernel. Curse of dimensionality bites hard — N^(−4/(d+4)) convergence rate; with d >= 5, KDE is mostly impractical. For high-dim density estimation, use parametric mixtures (Gaussian mixture models) or modern alternatives (normalizing flows, autoregressive density estimators).

Why KDE matters

KDE is the workhorse for:

  • Exploratory data analysis (visualise the data's shape without bin-edge artifacts).
  • Computing pdf-based statistics (mode, modes for multimodal data).
  • Inputs to other procedures (e.g., density-ratio estimation, importance sampling).
  • Diagnostic plots in regression and time-series.
  • Non-parametric goodness-of-fit testing.

Kde ExplorerInteractive figure — enable JavaScript to interact.

Try it

  • Defaults: N = 80, Silverman bandwidth. The KDE (red) is reasonable but undersmoothes between the two mixture peaks. The histogram (gray) gives a coarse but bin-edge-sensitive view; rug (blue) shows individual data positions.
  • Click "Undersmooth (h = 0.15)". KDE becomes wiggly, following each individual data point — the high-variance regime. Bumps appear at each datum; bimodal structure is exaggerated.
  • Click "Oversmooth (h = 1.5)". KDE smooths over both modes into a single broad mound — high bias, multimodal structure is gone. This is what happens when bandwidth is chosen too large.
  • Bump N to 300. With more data, Silverman's default bandwidth gets smaller (proportional to N^(-1/5)) and KDE follows the true shape more closely. Integrated MSE drops by roughly N^(-4/5).
  • Switch to Manual and explore: try h = 0.4 with N = 200. Good resolution of both peaks. Lower N + the same h = noisier estimate. The Silverman bandwidth itself depends on N — see how it shifts in the readout.

A data analyst wants to plot the distribution of household incomes (right-skewed, bounded at 0). Why might a histogram be misleading and how should they configure KDE for this case?

What you now know

KDE estimates a continuous density by placing kernels at each data point. Bandwidth controls bias-variance trade-off; Silverman's rule gives a Normal-target default, Sheather-Jones a better plug-in alternative. Boundary problems require reflection or transformation. Curse of dimensionality limits KDE to d <= 4 in practice. §8.6 closes Part 8 with QUANTILE REGRESSION — a distribution-free alternative for regression that estimates quantiles directly.

References

  • Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall. (The standard reference.)
  • Wand, M.P., Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall.
  • Sheather, S.J., Jones, M.C. (1991). "A reliable data-based bandwidth selection method for kernel density estimation." JRSS-B 53(3), 683–690.
  • Scott, D.W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization (2nd ed.). Wiley.
  • Hyndman, R.J., Bashtannyk, D.M., Grunwald, G.K. (1996). "Estimating and visualizing conditional densities." J. Comp. & Graph. Stat. 5(4), 315–336.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.