Regularisation: ridge, lasso, elastic net

Part 9 — Machine learning for researchers

Learning objectives

Define RIDGE regression as OLS + L2 penalty on coefficients
Define LASSO regression as OLS + L1 penalty, with the automatic FEATURE SELECTION property
Recognise ELASTIC NET as the weighted combination of L1 and L2
Apply cross-validation to choose the regularisation parameter λ
Recognise when regularisation HELPS (high-dim sparse, p > n, correlated predictors) and when it doesn't

OLS regression is unbiased but has high variance in high-dimensional or near-singular settings. REGULARISATION trades a small amount of bias for a large reduction in variance — typically improving out-of-sample prediction. The two foundational forms: RIDGE (L2 penalty) and LASSO (L1 penalty). The L1 case adds a powerful bonus: AUTOMATIC FEATURE SELECTION.

Ridge regression (Hoerl & Kennard 1970)

The ridge estimator minimises

\hat{\beta}_{\text{ridge}} = \arg\min_\beta \sum_{i=1}^N (y_i - x_i^T \beta)^2 + \lambda \sum_{j=1}^P \beta_j^2.

The closed-form solution is

\hat{\beta}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y.

Adding $\lambda I$ to the X'X matrix makes it invertible even when X is rank-deficient or P > N. Coefficients SHRINK toward zero — never reaching it for any $\lambda < \infty$ . The amount of shrinkage is proportional to the coefficient's magnitude divided by its uncertainty.

Lasso (Tibshirani 1996)

The lasso estimator minimises

\hat{\beta}_{\text{lasso}} = \arg\min_\beta \sum_{i=1}^N (y_i - x_i^T \beta)^2 + \lambda \sum_{j=1}^P |\beta_j|.

The L1 penalty has a SPECIAL GEOMETRIC PROPERTY: the constraint surface $\sum |\beta_j| \le t$ has corners at the axes. The optimal solution often lies AT such a corner, making some coefficients exactly zero. Lasso both SHRINKS and SELECTS — automatic feature selection. No closed-form solution; standard algorithms: coordinate descent (LARS, glmnet's default), proximal gradient, ADMM.

The L1 vs L2 difference

The geometric intuition: imagine the OLS contour (an ellipse around the OLS solution) intersecting a constraint set. For L2 (ridge): the constraint is a sphere, intersection is interior — all coefficients non-zero but smaller. For L1 (lasso): the constraint has SHARP CORNERS at the axes; the intersection tends to be at a corner, where one or more coefficients are zero.

Statistically: lasso's sparse solutions are good when the true coefficient vector is SPARSE (most predictors are irrelevant); ridge is better when ALL coefficients are non-zero but small. In real data, sparsity often holds — most predictors don't matter — making lasso's feature-selection property genuinely useful.

Elastic net (Zou & Hastie 2005)

Combines L1 and L2:

\hat{\beta}_{\text{EN}} = \arg\min_\beta \sum (y_i - x_i^T \beta)^2 + \lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2.

Inherits lasso's feature selection (L1 term) AND ridge's grouping property (L2 term groups correlated predictors so they're selected together). Useful when predictors are highly correlated and lasso would otherwise pick one arbitrarily from each correlated cluster. Default in many modern applications.

Choosing λ via cross-validation

The standard procedure:

Choose a grid of $\lambda$ values, log-spaced from very small (≈ OLS) to very large (≈ everything shrunk to zero).
For each λ, fit the model and compute k-fold CV error.
Pick the λ that minimises CV error, OR (by 1-SE rule) the largest λ whose CV is within 1 SE of the minimum.

The 1-SE rule biases toward simpler (more regularised) models — often preferred for interpretability and robustness to small CV-error differences. R's glmnet automates this; Python's scikit-learn: LassoCV, RidgeCV, ElasticNetCV.

Beyond linear regression

L1/L2 penalties generalise:

Logistic regression: lasso/ridge logistic regression for classification with feature selection.
GLMs: penalised Poisson, NB, gamma regressions — same penalty terms added to the log-likelihood.
Survival models: penalised Cox regression.
Multinomial: lasso multinomial logistic for multi-class problems.
Neural networks: weight decay = L2 regularisation; dropout = stochastic regularisation.

Statistical properties

Under regularity conditions, lasso achieves the ORACLE PROPERTY (Knight & Fu 2000, Donoho 2006, Bickel-Ritov-Tsybakov 2009): with the right λ, lasso identifies the true non-zero coefficients with probability → 1 and estimates them as efficiently as if the irrelevant predictors had been known to be irrelevant. Modern semi-supervised methods (de-biased lasso, SCAD, MCP) further refine this.

Try it

Start in Lasso mode, λ = 1 (log₁₀(λ) = 0). The three true coefficients (β₁=2, β₂=1, β₃=-1.5) are clearly visible as the thicker lines; the seven noise coefficients (thinner lines) are mostly near zero already.
Drag λ to larger values. Watch the lasso paths: as λ grows, more and more coefficients are zeroed out. At very large λ, even the true coefficients are zeroed (over-regularization). Lasso is BOTH shrinking AND selecting.
Switch to Ridge mode. Watch the same λ-sweep: all coefficients shrink proportionally — none are EXACTLY zero, even at large λ. The two methods have fundamentally different behaviour.
Re-sample several times. The noise coefficients vary across re-samples (high variance). Lasso's feature-selection is somewhat stable but not perfectly so; ridge's shrinkage is monotone.
At λ around 0.1-0.3 in Lasso mode, you should see exactly 3-4 non-zero coefficients with the true coefficients well-estimated. This is the sweet spot for sparsity recovery — modern CV procedures aim for this.

A scientist has 100 predictors but suspects only 5-10 truly matter. They have N = 200 observations. Which is the natural choice: ridge, lasso, or elastic net, and why?

What you now know

Ridge shrinks all coefficients; lasso shrinks AND selects (sparse solutions); elastic net combines both. Choose λ via cross-validation, with 1-SE rule as a practical default. L1 penalty's geometric property (corners at axes) yields automatic feature selection — the most useful single trick in modern ML for high-dimensional sparse problems. §9.3 next: trees, random forests, and gradient boosting — the other workhorse of modern ML.

References

Hoerl, A.E., Kennard, R.W. (1970). "Ridge regression: Biased estimation for nonorthogonal problems." Technometrics 12(1), 55–67.
Tibshirani, R. (1996). "Regression shrinkage and selection via the lasso." JRSS-B 58(1), 267–288. (The lasso paper.)
Zou, H., Hastie, T. (2005). "Regularization and variable selection via the elastic net." JRSS-B 67(2), 301–320.
Hastie, T., Tibshirani, R., Wainwright, M. (2015). Statistical Learning with Sparsity. CRC. (Comprehensive modern reference.)
Bickel, P.J., Ritov, Y., Tsybakov, A.B. (2009). "Simultaneous analysis of lasso and Dantzig selector." Annals of Statistics 37(4), 1705–1732.