Kernel density estimation

Part 8 — Resampling and nonparametrics

Learning objectives

Define KDE: place a kernel at each data point; sum and normalise
Compute the optimal bandwidth via Silverman's rule of thumb
Recognise the BIAS-VARIANCE trade-off in bandwidth selection
Apply alternative bandwidth selectors: cross-validation, plug-in (Sheather-Jones)
Apply KDE to multimodal data and compare to histogram-based smoothing

KDE is the canonical nonparametric estimator of a continuous probability density. Given data $x_1, \ldots, x_N \sim F$ , the kernel density estimator at point x is

\hat{f}(x) = \frac{1}{N h} \sum_{i=1}^N K\left(\frac{x - x_i}{h}\right),

where $K$ is a kernel function (e.g., the standard Normal density) and $h$ is the BANDWIDTH (smoothing parameter). Geometrically, KDE places a small "bump" of width h at each data point and sums them. Choice of kernel matters less than choice of bandwidth — most reasonable kernels give similar results.

Bandwidth: the central design choice

Bandwidth controls the bias-variance trade-off:

Small h (undersmoothing): KDE has many wiggles, following each data point. Bias is small (each point contributes its own bump) but variance is high (the estimate depends sensitively on which observations were drawn).
Large h (oversmoothing): KDE is very smooth, possibly blurring real structure like multimodality. Bias is high (true sharp peaks become broad mounds); variance is low (the estimate is stable across resamples).
Optimal h: minimises the MEAN INTEGRATED SQUARED ERROR (MISE) $\int E[(\hat{f}(x) - f(x))^2] dx$ . For a true Normal density, the AMISE-optimal h is $h^* = 1.06 \sigma N^{-1/5}$ — SILVERMAN'S RULE OF THUMB.

Silverman's rule of thumb (1986)

The most commonly used default:

h = 1.06 \hat{\sigma} N^{-1/5},

where $\hat{\sigma} = \min(\hat{s}, \text{IQR}/1.34)$ to be robust against outliers. Assumes the true density is approximately Normal — so for multimodal or heavy-tailed data, Silverman tends to OVERSMOOTH. Good "starter" bandwidth; refine via cross-validation if the data clearly aren't Normal.

Cross-validation bandwidth

Pick h to minimise the cross-validated integrated squared error:

\text{CV}(h) = \int \hat{f}^2(x) dx - \frac{2}{N} \sum_{i=1}^N \hat{f}_{-i}(x_i),

where $\hat{f}_{-i}$ is the KDE without observation i. Minimise over h via grid search. Doesn't assume any parametric form; adapts to the data shape. R's density(x, bw = "ucv") and Python's scipy.stats.gaussian_kde(bw_method = "scott") support various selectors.

Plug-in bandwidth: Sheather-Jones

The Sheather-Jones (1991) plug-in estimator uses an iterative procedure that plugs estimates of the unknown functionals $\int f''(x)^2 dx$ into the AMISE-optimal formula. More accurate than Silverman, computationally heavier. Often the modern default in production.

Adaptive bandwidth

FIXED-bandwidth KDE uses the same h everywhere. ADAPTIVE-bandwidth KDE uses larger h in low-density regions (where data is sparse) and smaller h in high-density regions (where data is dense). Particularly useful for distributions with sharp peaks mixed with broad tails. Computational cost is higher; benefit depends on the distribution.

Boundary problems

For bounded distributions (e.g., densities on [0, ∞) or [0, 1]), KDE near the boundary is biased — the kernel "spreads probability mass" outside the support. Fix: reflection method (extend the kernel symmetrically about the boundary), boundary kernels, or transform-then-back-transform (estimate density of log(x) for x > 0, then transform back via change of variables).

Multivariate KDE

For d-dimensional data, KDE generalises: kernel matrix $H$ (full d × d) plus product kernel. Curse of dimensionality bites hard — N^(−4/(d+4)) convergence rate; with d >= 5, KDE is mostly impractical. For high-dim density estimation, use parametric mixtures (Gaussian mixture models) or modern alternatives (normalizing flows, autoregressive density estimators).

Why KDE matters

KDE is the workhorse for:

Exploratory data analysis (visualise the data's shape without bin-edge artifacts).
Computing pdf-based statistics (mode, modes for multimodal data).
Inputs to other procedures (e.g., density-ratio estimation, importance sampling).
Diagnostic plots in regression and time-series.
Non-parametric goodness-of-fit testing.

Try it

Defaults: N = 80, Silverman bandwidth. The KDE (red) is reasonable but undersmoothes between the two mixture peaks. The histogram (gray) gives a coarse but bin-edge-sensitive view; rug (blue) shows individual data positions.
Click "Undersmooth (h = 0.15)". KDE becomes wiggly, following each individual data point — the high-variance regime. Bumps appear at each datum; bimodal structure is exaggerated.
Click "Oversmooth (h = 1.5)". KDE smooths over both modes into a single broad mound — high bias, multimodal structure is gone. This is what happens when bandwidth is chosen too large.
Bump N to 300. With more data, Silverman's default bandwidth gets smaller (proportional to N^(-1/5)) and KDE follows the true shape more closely. Integrated MSE drops by roughly N^(-4/5).
Switch to Manual and explore: try h = 0.4 with N = 200. Good resolution of both peaks. Lower N + the same h = noisier estimate. The Silverman bandwidth itself depends on N — see how it shifts in the readout.

A data analyst wants to plot the distribution of household incomes (right-skewed, bounded at 0). Why might a histogram be misleading and how should they configure KDE for this case?

What you now know

KDE estimates a continuous density by placing kernels at each data point. Bandwidth controls bias-variance trade-off; Silverman's rule gives a Normal-target default, Sheather-Jones a better plug-in alternative. Boundary problems require reflection or transformation. Curse of dimensionality limits KDE to d <= 4 in practice. §8.6 closes Part 8 with QUANTILE REGRESSION — a distribution-free alternative for regression that estimates quantiles directly.

References

Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall. (The standard reference.)
Wand, M.P., Jones, M.C. (1995). Kernel Smoothing. Chapman & Hall.
Sheather, S.J., Jones, M.C. (1991). "A reliable data-based bandwidth selection method for kernel density estimation." JRSS-B 53(3), 683–690.
Scott, D.W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization (2nd ed.). Wiley.
Hyndman, R.J., Bashtannyk, D.M., Grunwald, G.K. (1996). "Estimating and visualizing conditional densities." J. Comp. & Graph. Stat. 5(4), 315–336.