Confounding and the DAG toolkit
Learning objectives
- Represent causal assumptions as Directed Acyclic Graphs (DAGs)
- Identify the three canonical configurations: chain, fork, collider
- Distinguish a confounder, a mediator, and a collider — they ALL look similar in the data, but require opposite adjustment strategies
- Apply the back-door criterion to determine which variables to adjust for
- Recognise the danger of conditioning on colliders (Berkson's paradox)
- Use DAGs to make identification assumptions explicit and challengeable
When randomisation isn't available, you have OBSERVATIONAL data. The observed association is NOT the ATE; the gap is called CONFOUNDING. DAGs — Directed Acyclic Graphs — are the modern formalism developed by Pearl (1995, 2009) for thinking about confounding precisely. They tell you WHICH variables to adjust for, and WHICH NOT TO. Get this wrong — especially by adjusting for a collider — and your "controlled" analysis is more biased than the unadjusted one.
DAG basics
A DAG is a graph with VARIABLES as nodes and DIRECT CAUSAL EFFECTS as directed arrows. The arrows represent the direction in which the data-generating process moves: X → Y means "X is one of the things that causally determines Y" (along with X's coefficient and any noise terms).
The "acyclic" part: no variable can be its own ancestor. You can't have X → Y → X. This rules out feedback loops within the same timestep.
Three canonical node configurations
Every DAG is built from three local patterns:
- Chain: X → Z → Y. Information flows from X to Y through Z. The variable Z MEDIATES the effect of X on Y.
- Fork: X ← Z → Y. Z is a common CAUSE of X and Y. Z is a CONFOUNDER — an open back-door path between X and Y that biases the X-Y association.
- Collider: X → Z ← Y. Z is a common EFFECT of X and Y. The path X → Z ← Y is BLOCKED by default. Conditioning on Z OPENS the path, creating a spurious association.
The back-door criterion
To identify the causal effect of T on Y in a DAG, find a set Z of variables such that:
- Z blocks ALL "back-door" paths from T to Y — paths that start with an arrow INTO T.
- Z contains NO descendants of T.
If such a Z exists and is observable, the ATE is identified by adjustment:
This is the formal justification for "include the confounder in the regression". It works only when Z really does block all back-door paths AND contains no descendants of T.
Confounder vs mediator vs collider — opposite adjustment rules
The three configurations look almost identical in raw data: T, Y, and some third variable Z. But the correct treatment differs SIGN-WISE:
| Configuration | Role of Z | Adjust for Z? | Why |
|---|---|---|---|
| Fork | Confounder | YES | Blocks the back-door path T ← Z → Y; removes bias. |
| Chain | Mediator | Depends | Adjusting BLOCKS the indirect path; gives direct effect only. Don't adjust if you want the TOTAL effect. |
| Collider | Common effect | NO | Conditioning OPENS the path T → Z ← Y, creating spurious correlation (Berkson's paradox). |
Berkson's paradox: the collider trap
Berkson (1946) noticed that hospitalised patients show a NEGATIVE correlation between two diseases A and B, even though A and B are independent in the population. Why? Because being hospitalised (Z) is a common effect of both diseases — either disease alone is bad enough to put you in hospital. Conditioning on Z = hospitalised filters to a population where both can't be high (one or the other suffices), creating a spurious negative correlation.
The general lesson: every "controlled for variables X" claim is a DAG claim. Adjusting for the wrong X variables can INTRODUCE bias rather than remove it.
Worked example: smoking, lung cancer, and tar
Consider the DAG: Smoking → Tar deposit in lungs → Lung Cancer, with Smoking also having a direct effect on Lung Cancer through other mechanisms.
- To estimate the TOTAL causal effect of Smoking on Lung Cancer: DON'T adjust for Tar (it's a mediator).
- To estimate the DIRECT effect of Smoking on Lung Cancer (mechanisms other than tar): DO adjust for Tar.
- If there's an unobserved Z that causes both Smoking and Lung Cancer (e.g., Genetics): adjusting for Z would help, but if Z is unobserved you have residual confounding — document this and run sensitivity analysis (§6.8).
Using DAGs in research practice
- Before any regression: DRAW the DAG. Be explicit about what causes what.
- For each candidate "control variable", classify it: confounder, mediator, collider, or descendant of T.
- Apply the back-door criterion to choose the adjustment set.
- Report the DAG with the paper — let reviewers challenge your causal assumptions.
- Sensitivity analysis (§6.8) bounds the damage if unobserved confounders exist.
Try it
- Start with Confounder Z, true τ = 1.0. Note the unadjusted OLS estimate is heavily biased (~2.0-3.0) — confounding by Z. The adjusted OLS recovers ≈ 1.0. Adjustment is the cure.
- Set τ = 0 under the confounder DAG. The unadjusted estimate is still POSITIVE — pure confounding bias with NO true effect. Adjusting recovers ≈ 0.
- Switch to Mediator Z with τ = 1.0. Note: the true DIRECT effect (set in the simulator) is 0.5; the unadjusted estimate captures the TOTAL effect (direct + indirect via Z); the adjusted estimate captures the DIRECT only. Both are "correct" — they answer different questions.
- Switch to Collider Z. True effect of T on Y is 0 in this DAG. Unadjusted OLS correctly reports ~0. Adjusted OLS reports a SPURIOUS positive or negative estimate — Berkson's paradox in action. NEVER adjust for colliders.
- Hit "re-sample data" a few times under the collider DAG. The spurious adjusted estimate is consistent across seeds — it's structural bias, not noise.
Hospital admissions are a common-effect of both severity of injury (T) and presence of underlying disease (Y). A study of hospitalised patients shows injury severity is negatively correlated with disease prevalence. Is this evidence that severe injury PROTECTS against disease, or is it a collider bias?
What you now know
DAGs are the language of modern causal inference. The back-door criterion tells you which variables to adjust for: confounders YES, colliders NO, mediators (depends on whether you want total or direct effect). The widget demonstrates each configuration with simulated data — see the bias appear and disappear under different adjustment strategies. §6.4 turns to PROPENSITY-SCORE methods, which scale this back-door adjustment to high-dimensional X by collapsing it into a one-dimensional summary.
References
- Pearl, J. (1995). "Causal diagrams for empirical research." Biometrika 82(4), 669–688. (The foundational DAG paper for empirical work.)
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press. (The canonical book-length development of DAGs + do-calculus.)
- Greenland, S., Pearl, J., Robins, J.M. (1999). "Causal diagrams for epidemiologic research." Epidemiology 10(1), 37–48. (The epidemiology classic introducing DAGs to applied health research.)
- Hernán, M.A., Robins, J.M. (2020). Causal Inference: What If. Chapman & Hall. (Chapter 6 develops DAGs with extensive worked examples.)
- Cinelli, C., Forney, A., Pearl, J. (2024). "A crash course in good and bad controls." Sociological Methods & Research. (Modern catalogue of which variables to include/exclude in causal regressions.)