Causal warnings: regression is not causation

Part 4 — Linear regression, done seriously

Learning objectives

Distinguish prediction from causal estimation in linear regression
Recognise that regression coefficients are partial correlations, not causal effects, without additional structure
Apply the DAG framework (preview Part 6) to think about confounding
Identify the three classical threats to causal inference: confounding, reverse causation, selection
Set realistic expectations for what causal claims a regression can support

Part 4 has built a sophisticated regression toolkit: OLS geometry, assumption diagnostics, robust SEs, robust regression, interactions, model selection. None of this gives you CAUSAL inference for free. §4.8 closes Part 4 with the honest warning: regression coefficients measure ASSOCIATIONS, not effects. Causal claims require additional assumptions about the data-generating process — addressed properly in Part 6 (Causal Inference).

What a regression coefficient actually is

The coefficient $\beta_j$ in a linear regression is the conditional expectation slope:

\beta_j = \frac{\partial E[Y | X]}{\partial x_j} \text{ holding all other } x_k \text{ fixed}.

This is a STATEMENT ABOUT THE CONDITIONAL EXPECTATION. It tells you, for the population, how Y differs on average between groups defined by different x_j values, holding the OTHER X variables fixed. It does NOT tell you what would happen if you EXPERIMENTALLY CHANGED x_j.

The three classical causal threats

1. Confounding. A third variable Z affects both X and Y. The regression coefficient on X picks up the part of Y that's explained by Z — not by X itself.

Example: a regression of income on coffee-drinking shows a positive coefficient. Conclude "coffee causes income"? No — education affects both. The coffee coefficient picks up the income variation across education levels.

Fix: include the confounder Z in the regression. But you must KNOW or POSIT what Z is. Unmeasured confounders ⇒ biased coefficients.

2. Reverse causation. Y causes X, not X → Y.

Example: people who exercise more have lower body fat. But people with lower body fat also find exercise easier. Cross-sectional regression can't distinguish.

Fix: temporal data (X precedes Y); experimental design (you control X); instrumental variables (Part 6).

3. Selection bias. The sample isn't representative of the population to which the conclusion would apply.

Example: surveying gym-goers about exercise habits → biased estimate of population exercise. The selection ON exercise drives the apparent relationship.

Fix: random sampling; matching; inverse probability weighting; sensitivity analysis (Part 6).

The DAG framework (preview Part 6)

A directed acyclic graph (DAG) is a graphical representation of the causal data-generating process. Nodes are variables; arrows are direct causal effects. The DAG specifies which adjustments are needed (the "backdoor criterion") to recover causal effects from observational data. Part 6 develops this rigorously.

Three practical recommendations

State the goal explicitly. "I want to predict Y from X" (prediction) is different from "I want to estimate the effect of X on Y" (causation). Communicate which.
Document the assumptions. If claiming causation, draw the DAG; specify what could confound; do sensitivity analysis.
Use the right tool. Pure prediction: any well-fitting model is fine. Causal effect estimation: randomised experiment, IV, RDD, DiD, propensity scores (Part 6).

Try it

Defaults: "True causal" scenario. Visible scatter + OLS line with slope ≈ 0.7. The regression coefficient IS the causal effect in this scenario — straightforward interpretation.
Switch to "Confounded" scenario. The scatter looks similar; the OLS slope is still ≈ 0.7 (by construction). But the TRUE causal effect of X on Y is ZERO — the slope is entirely driven by the shared cause Z. A naive interpretation would falsely conclude X causes Y.
Switch to "Reverse causation" scenario. Again, similar scatter, slope ≈ 0.7. But the ARROW direction is wrong: Y → X, not X → Y. Naive interpretation reverses the causal direction.
Resample (new seed) on each scenario. The slope wobbles slightly but the qualitative finding holds — same slope, different truth.
Increase N from 100 to 500. The slope tightens around its expected value in all three scenarios — bigger samples produce TIGHTER but EQUALLY MISLEADING coefficients when the causal structure is misidentified.
The key takeaway: regression is a powerful DESCRIPTIVE tool that summarises CONDITIONAL EXPECTATIONS. It is NOT a causal magic wand. Distinguishing the three scenarios above requires additional structure — RCT, IV, RDD, DiD, DAG-based identification (Part 6).

A press release reports that "exercise reduces depression risk" based on a regression of depression score on exercise hours, adjusting for age and sex, in 50,000 adults. The coefficient is highly significant. What is the MOST LIKELY threat to the causal interpretation, and which method in Part 6 would you propose to address it?

What you now know

Regression coefficients are conditional-expectation slopes — statements about associations conditional on the other regressors. Causal claims require additional structure (RCT, DAG-based identification, instruments). The three classical threats — confounding, reverse causation, selection — affect every observational regression. Part 6 develops the causal-inference toolkit in detail; for now, this closes Part 4 with the right mental model: regression is a powerful descriptive tool, not a causal magic wand.

SDS Part 4 (Linear Regression) is complete. Part 5 turns to Generalised Linear Models for non-Normal outcomes (binary, count, etc.). Part 6 returns to the causal-inference toolkit promised here.

References

Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press. (The canonical DAG-based causal inference book.)
Imbens, G.W., Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press. (The potential-outcomes-framework canonical text.)
Hernán, M.A., Robins, J.M. (2020). Causal Inference: What If. Chapman & Hall. (Modern applied causal inference; free PDF widely available.)
Angrist, J.D., Pischke, J.S. (2009). Mostly Harmless Econometrics. Princeton University Press. (Practical applied causal econometrics with regression-as-tool framing.)
Greenland, S., Pearl, J., Robins, J.M. (1999). "Causal diagrams for epidemiologic research." Epidemiology 10(1), 37–48. (The DAG epidemiology classic.)