Kernel density estimation
Learning objectives
- Define KDE: place a kernel at each data point; sum and normalise
- Compute the optimal bandwidth via Silverman's rule of thumb
- Recognise the BIAS-VARIANCE trade-off in bandwidth selection
- Apply alternative bandwidth selectors: cross-validation, plug-in (Sheather-Jones)
- Apply KDE to multimodal data and compare to histogram-based smoothing
KDE is the canonical nonparametric estimator of a continuous probability density. Given data , the kernel density estimator at point x is
where is a kernel function (e.g., the standard Normal density) and is the BANDWIDTH (smoothing parameter). Geometrically, KDE places a small "bump" of width h at each data point and sums them. Choice of kernel matters less than choice of bandwidth — most reasonable kernels give similar results.
Bandwidth: the central design choice
Bandwidth controls the bias-variance trade-off:
- Small h (undersmoothing): KDE has many wiggles, following each data point. Bias is small (each point contributes its own bump) but variance is high (the estimate depends sensitively on which observations were drawn).
- Large h (oversmoothing): KDE is very smooth, possibly blurring real structure like multimodality. Bias is high (true sharp peaks become broad mounds); variance is low (the estimate is stable across resamples).
- Optimal h: minimises the MEAN INTEGRATED SQUARED ERROR (MISE) . For a true Normal density, the AMISE-optimal h is — SILVERMAN'S RULE OF THUMB.
Silverman's rule of thumb (1986)
The most commonly used default:
where to be robust against outliers. Assumes the true density is approximately Normal — so for multimodal or heavy-tailed data, Silverman tends to OVERSMOOTH. Good "starter" bandwidth; refine via cross-validation if the data clearly aren't Normal.
Cross-validation bandwidth
Pick h to minimise the cross-validated integrated squared error:
where is the KDE without observation i. Minimise over h via grid search. Doesn't assume any parametric form; adapts to the data shape. R's density(x, bw = "ucv") and Python's scipy.stats.gaussian_kde(bw_method = "scott") support various selectors.
Plug-in bandwidth: Sheather-Jones
The Sheather-Jones (1991) plug-in estimator uses an iterative procedure that plugs estimates of the unknown functionals into the AMISE-optimal formula. More accurate than Silverman, computationally heavier. Often the modern default in production.
Adaptive bandwidth
FIXED-bandwidth KDE uses the same h everywhere. ADAPTIVE-bandwidth KDE uses larger h in low-density regions (where data is sparse) and smaller h in high-density regions (where data is dense). Particularly useful for distributions with sharp peaks mixed with broad tails. Computational cost is higher; benefit depends on the distribution.
Boundary problems
For bounded distributions (e.g., densities on [0, ∞) or [0, 1]), KDE near the boundary is biased — the kernel "spreads probability mass" outside the support. Fix: reflection method (extend the kernel symmetrically about the boundary), boundary kernels, or transform-then-back-transform (estimate density of log(x) for x > 0, then transform back via change of variables).
Multivariate KDE
For d-dimensional data, KDE generalises: kernel matrix (full d × d) plus product kernel. Curse of dimensionality bites hard — N^(−4/(d+4)) convergence rate; with d >= 5, KDE is mostly impractical. For high-dim density estimation, use parametric mixtures (Gaussian mixture models) or modern alternatives (normalizing flows, autoregressive density estimators).
Why KDE matters
KDE is the workhorse for:
- Exploratory data analysis (visualise the data's shape without bin-edge artifacts).
- Computing pdf-based statistics (mode, modes for multimodal data).
- Inputs to other procedures (e.g., density-ratio estimation, importance sampling).
- Diagnostic plots in regression and time-series.
- Non-parametric goodness-of-fit testing.
Try it
- Defaults: N = 80, Silverman bandwidth. The KDE (red) is reasonable but undersmoothes between the two mixture peaks. The histogram (gray) gives a coarse but bin-edge-sensitive view; rug (blue) shows individual data positions.
- Click "Undersmooth (h = 0.15)". KDE becomes wiggly, following each individual data point — the high-variance regime. Bumps appear at each datum; bimodal structure is exaggerated.
- Click "Oversmooth (h = 1.5)". KDE smooths over both modes into a single broad mound — high bias, multimodal structure is gone. This is what happens when bandwidth is chosen too large.
- Bump N to 300. With more data, Silverman's default bandwidth gets smaller (proportional to N^(-1/5)) and KDE follows the true shape more closely. Integrated MSE drops by roughly N^(-4/5).
- Switch to Manual and explore: try h = 0.4 with N = 200. Good resolution of both peaks. Lower N + the same h = noisier estimate. The Silverman bandwidth itself depends on N — see how it shifts in the readout.
A data analyst wants to plot the distribution of household incomes (right-skewed, bounded at 0). Why might a histogram be misleading and how should they configure KDE for this case?
What you now know
KDE estimates a continuous density by placing kernels at each data point. Bandwidth controls bias-variance trade-off; Silverman's rule gives a Normal-target default, Sheather-Jones a better plug-in alternative. Boundary problems require reflection or transformation. Curse of dimensionality limits KDE to d <= 4 in practice. §8.6 closes Part 8 with QUANTILE REGRESSION — a distribution-free alternative for regression that estimates quantiles directly.
References
- Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall. (The standard reference.)
- Wand, M.P., Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall.
- Sheather, S.J., Jones, M.C. (1991). "A reliable data-based bandwidth selection method for kernel density estimation." JRSS-B 53(3), 683–690.
- Scott, D.W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization (2nd ed.). Wiley.
- Hyndman, R.J., Bashtannyk, D.M., Grunwald, G.K. (1996). "Estimating and visualizing conditional densities." J. Comp. & Graph. Stat. 5(4), 315–336.