Causal forests and double ML

Part 9 — Machine learning for researchers

Learning objectives

State the DOUBLE/DEBIASED ML framework (Chernozhukov et al. 2017)
Recognise cross-fitting as the essential ingredient that combines ML flexibility with valid inference
Apply DOUBLE ML to partially-linear models: τ from ML nuisances + linear residual regression
Introduce CAUSAL FORESTS (Athey-Wager 2019) for heterogeneous treatment effects
Recognise modern applied econometrics tools: doubleml, EconML, grf

The chapter has covered ML for prediction (§§9.1–9.5). The natural question: can ML also help with causal inference? In high-dimensional settings (10+ confounders), classical OLS adjustment requires strong functional-form assumptions about how X affects Y. Modern hybrid methods — DOUBLE ML (Chernozhukov et al. 2017) and CAUSAL FORESTS (Athey-Wager 2019) — bring ML's flexibility to causal inference WITHOUT sacrificing valid inference.

The setup

Consider the partially linear model:

Y = \tau T + g(X) + \varepsilon_Y, \quad T = m(X) + \varepsilon_T,

where $T$ is the treatment, $\tau$ the (homogeneous) causal effect, $X$ a (possibly high-dim) vector of confounders, and $g, m$ are unknown smooth functions. Classical approaches use parametric specifications of g (e.g., linear in X). When X is high-dim or the relationships are nonlinear, this fails.

The double ML idea

The key observation (Robinson 1988, made modern by Chernozhukov et al. 2017): if we know g and m, we can FRISCH-WAUGH the system:

Y - g(X) = \tau (T - m(X)) + \varepsilon_Y.

Equivalently: take residuals of Y on X and T on X; regress Y-residuals on T-residuals; the slope IS $\tau$ . The trick: ESTIMATE $\hat{g}$ and $\hat{m}$ via ANY ML method (random forest, gradient boosting, neural net, lasso), then compute residuals, then linear regression.

Cross-fitting: the essential ingredient

Naively plugging in ML estimates $\hat{g}$ and $\hat{m}$ to the residuals introduces BIAS — the ML estimates depend on the same data used in the residual regression. Chernozhukov et al. solve this via CROSS-FITTING:

Split data into K folds.
For each fold, train $\hat{g}$ and $\hat{m}$ on the OTHER K-1 folds.
Use $\hat{g}, \hat{m}$ to compute residuals on this held-out fold.
Pool all out-of-fold residuals, do the final linear regression.

The cross-fitting separates the data used for nuisance estimation from the data used for the final regression — eliminating the bias. NEYMAN ORTHOGONALITY of the moment condition ensures that small ML errors in $\hat{g}, \hat{m}$ don't damage the $\sqrt{n}$ asymptotic distribution of $\hat{\tau}$ . Result: VALID CIs with no further inflation, even when the nuisances are estimated by black-box ML.

The big result

Theorem (Chernozhukov et al. 2017, simplified): if $\hat{g}$ and $\hat{m}$ converge at rate $n^{-1/4}$ (slower than the parametric $n^{-1/2}$ ), then $\hat{\tau}_{\text{DML}}$ is $\sqrt{n}$ -consistent and asymptotically Normal with valid CIs computed by standard formula. ML's flexibility + classical inference's rigor.

Causal forests (Athey-Wager 2019)

For HETEROGENEOUS treatment effects $\tau(x) = E[Y(1) - Y(0) \mid X = x]$ that vary across individuals, single estimators of average τ are insufficient. CAUSAL FORESTS extend random forests:

Build trees whose splits maximise treatment-effect heterogeneity (not classification or regression).
Each tree estimates a local treatment effect within its leaves.
Average across the forest to estimate $\tau(x)$ .
Variance can be estimated via cross-fitting; valid CIs available.

Causal forests are now the standard tool for heterogeneous treatment effects in applied econometrics. R package grf (Athey, Wager, Stefan); Python EconML and causalml.

Other modern causal ML methods

R-Learner (Nie-Wager 2021): generalises DML; consistent under flexible models.
X-Learner (Künzel et al. 2019): designed for highly imbalanced treatment groups.
TMLE (Targeted Maximum Likelihood Estimation; van der Laan): semi-parametric efficient estimation with ML nuisances.
BART (Bayesian additive regression trees): tree-based Bayesian nonparametrics that naturally handle causal inference (e.g., bartcause R package).

What this DOESN'T solve

Modern causal ML still requires the IDENTIFICATION assumptions of §6:

No unobserved confounding: X must include ALL common causes. Double ML cannot fix the absence of an unmeasured confounder.
Positivity: P(T=1|X) bounded away from 0 and 1.
Stable treatment effects: SUTVA — no spillover between units.

What modern ML CAN do: relax the FUNCTIONAL FORM assumption. You no longer need to commit to a particular parametric form for g(X). What ML CANNOT do: substitute for unconfoundedness.

Try it

Default: τ = 1.0, N = 400. The naive OLS (Y ~ T only) is heavily biased — confounding by X dominates. Full OLS (Y ~ T + X) works in this LINEAR setting; it's the gold standard when functional form is known. DML recovers τ similarly with valid CI.
Drag N up to 2000. Both OLS and DML converge to true τ with shrinking CIs. DML's CI shrinks at √n rate (Neyman orthogonality).
Re-sample many times. The naive OLS is consistently biased; DML CIs cover the true τ ~95% of the time (valid frequentist coverage). The DML CI is the inferential statement.
Set true τ = 0. DML estimates near zero with CI including 0 — correctly fails to detect a non-existent effect. Naive OLS still shows non-zero bias (the confounding).
The widget uses ridge as the nuisance estimator (a simple ML model). In real applications, use random forest, gradient boosting, or neural networks for richer functional forms.

An economist has 50 confounders X, suspects nonlinear and interacted effects of X on Y, and wants to estimate the average treatment effect of T on Y. What modern approach is appropriate, and why isn't classical OLS sufficient?

What you now know

Double ML uses cross-fitted ML estimates of nuisance functions to compute debiased treatment effects with valid CIs. Causal forests extend random forests to heterogeneous treatment effects. Modern tools: grf, doubleml, EconML, BART. All still require identification assumptions (unconfoundedness, positivity, SUTVA). §9.7 next: reporting an ML result so the reader can trust it.

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J. (2017). "Double/debiased machine learning for treatment and structural parameters." The Econometrics Journal 21(1), C1–C68.
Wager, S., Athey, S. (2018). "Estimation and inference of heterogeneous treatment effects using random forests." JASA 113(523), 1228–1242.
Athey, S., Wager, S. (2019). "Estimating treatment effects with causal forests: An application." Observational Studies 5, 37–51.
Robinson, P.M. (1988). "Root-N-consistent semiparametric regression." Econometrica 56(4), 931–954.
Nie, X., Wager, S. (2021). "Quasi-oracle estimation of heterogeneous treatment effects." Biometrika 108(2), 299–319.