Confounding and the DAG toolkit

Part 6 — Causal inference for researchers

Learning objectives

Represent causal assumptions as Directed Acyclic Graphs (DAGs)
Identify the three canonical configurations: chain, fork, collider
Distinguish a confounder, a mediator, and a collider — they ALL look similar in the data, but require opposite adjustment strategies
Apply the back-door criterion to determine which variables to adjust for
Recognise the danger of conditioning on colliders (Berkson's paradox)
Use DAGs to make identification assumptions explicit and challengeable

When randomisation isn't available, you have OBSERVATIONAL data. The observed association is NOT the ATE; the gap is called CONFOUNDING. DAGs — Directed Acyclic Graphs — are the modern formalism developed by Pearl (1995, 2009) for thinking about confounding precisely. They tell you WHICH variables to adjust for, and WHICH NOT TO. Get this wrong — especially by adjusting for a collider — and your "controlled" analysis is more biased than the unadjusted one.

DAG basics

A DAG is a graph with VARIABLES as nodes and DIRECT CAUSAL EFFECTS as directed arrows. The arrows represent the direction in which the data-generating process moves: X → Y means "X is one of the things that causally determines Y" (along with X's coefficient and any noise terms).

The "acyclic" part: no variable can be its own ancestor. You can't have X → Y → X. This rules out feedback loops within the same timestep.

Three canonical node configurations

Every DAG is built from three local patterns:

Chain: X → Z → Y. Information flows from X to Y through Z. The variable Z MEDIATES the effect of X on Y.
Fork: X ← Z → Y. Z is a common CAUSE of X and Y. Z is a CONFOUNDER — an open back-door path between X and Y that biases the X-Y association.
Collider: X → Z ← Y. Z is a common EFFECT of X and Y. The path X → Z ← Y is BLOCKED by default. Conditioning on Z OPENS the path, creating a spurious association.

The back-door criterion

To identify the causal effect of T on Y in a DAG, find a set Z of variables such that:

Z blocks ALL "back-door" paths from T to Y — paths that start with an arrow INTO T.
Z contains NO descendants of T.

If such a Z exists and is observable, the ATE is identified by adjustment:

E[Y(t)] = E_Z\!\left[ E[Y \mid T = t, Z] \right].

This is the formal justification for "include the confounder in the regression". It works only when Z really does block all back-door paths AND contains no descendants of T.

Confounder vs mediator vs collider — opposite adjustment rules

The three configurations look almost identical in raw data: T, Y, and some third variable Z. But the correct treatment differs SIGN-WISE:

Configuration	Role of Z	Adjust for Z?	Why
Fork	Confounder	YES	Blocks the back-door path T ← Z → Y; removes bias.
Chain	Mediator	Depends	Adjusting BLOCKS the indirect path; gives direct effect only. Don't adjust if you want the TOTAL effect.
Collider	Common effect	NO	Conditioning OPENS the path T → Z ← Y, creating spurious correlation (Berkson's paradox).

Berkson's paradox: the collider trap

Berkson (1946) noticed that hospitalised patients show a NEGATIVE correlation between two diseases A and B, even though A and B are independent in the population. Why? Because being hospitalised (Z) is a common effect of both diseases — either disease alone is bad enough to put you in hospital. Conditioning on Z = hospitalised filters to a population where both can't be high (one or the other suffices), creating a spurious negative correlation.

The general lesson: every "controlled for variables X" claim is a DAG claim. Adjusting for the wrong X variables can INTRODUCE bias rather than remove it.

Worked example: smoking, lung cancer, and tar

Consider the DAG: Smoking → Tar deposit in lungs → Lung Cancer, with Smoking also having a direct effect on Lung Cancer through other mechanisms.

To estimate the TOTAL causal effect of Smoking on Lung Cancer: DON'T adjust for Tar (it's a mediator).
To estimate the DIRECT effect of Smoking on Lung Cancer (mechanisms other than tar): DO adjust for Tar.
If there's an unobserved Z that causes both Smoking and Lung Cancer (e.g., Genetics): adjusting for Z would help, but if Z is unobserved you have residual confounding — document this and run sensitivity analysis (§6.8).

Using DAGs in research practice

Before any regression: DRAW the DAG. Be explicit about what causes what.
For each candidate "control variable", classify it: confounder, mediator, collider, or descendant of T.
Apply the back-door criterion to choose the adjustment set.
Report the DAG with the paper — let reviewers challenge your causal assumptions.
Sensitivity analysis (§6.8) bounds the damage if unobserved confounders exist.

Try it

Start with Confounder Z, true τ = 1.0. Note the unadjusted OLS estimate is heavily biased (~2.0-3.0) — confounding by Z. The adjusted OLS recovers ≈ 1.0. Adjustment is the cure.
Set τ = 0 under the confounder DAG. The unadjusted estimate is still POSITIVE — pure confounding bias with NO true effect. Adjusting recovers ≈ 0.
Switch to Mediator Z with τ = 1.0. Note: the true DIRECT effect (set in the simulator) is 0.5; the unadjusted estimate captures the TOTAL effect (direct + indirect via Z); the adjusted estimate captures the DIRECT only. Both are "correct" — they answer different questions.
Switch to Collider Z. True effect of T on Y is 0 in this DAG. Unadjusted OLS correctly reports ~0. Adjusted OLS reports a SPURIOUS positive or negative estimate — Berkson's paradox in action. NEVER adjust for colliders.
Hit "re-sample data" a few times under the collider DAG. The spurious adjusted estimate is consistent across seeds — it's structural bias, not noise.

Hospital admissions are a common-effect of both severity of injury (T) and presence of underlying disease (Y). A study of hospitalised patients shows injury severity is negatively correlated with disease prevalence. Is this evidence that severe injury PROTECTS against disease, or is it a collider bias?

What you now know

DAGs are the language of modern causal inference. The back-door criterion tells you which variables to adjust for: confounders YES, colliders NO, mediators (depends on whether you want total or direct effect). The widget demonstrates each configuration with simulated data — see the bias appear and disappear under different adjustment strategies. §6.4 turns to PROPENSITY-SCORE methods, which scale this back-door adjustment to high-dimensional X by collapsing it into a one-dimensional summary.

References

Pearl, J. (1995). "Causal diagrams for empirical research." Biometrika 82(4), 669–688. (The foundational DAG paper for empirical work.)
Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press. (The canonical book-length development of DAGs + do-calculus.)
Greenland, S., Pearl, J., Robins, J.M. (1999). "Causal diagrams for epidemiologic research." Epidemiology 10(1), 37–48. (The epidemiology classic introducing DAGs to applied health research.)
Hernán, M.A., Robins, J.M. (2020). Causal Inference: What If. Chapman & Hall. (Chapter 6 develops DAGs with extensive worked examples.)
Cinelli, C., Forney, A., Pearl, J. (2024). "A crash course in good and bad controls." Sociological Methods & Research. (Modern catalogue of which variables to include/exclude in causal regressions.)