Confounding and the DAG toolkit

Part 6 — Causal inference for researchers

Learning objectives

  • Represent causal assumptions as Directed Acyclic Graphs (DAGs)
  • Identify the three canonical configurations: chain, fork, collider
  • Distinguish a confounder, a mediator, and a collider — they ALL look similar in the data, but require opposite adjustment strategies
  • Apply the back-door criterion to determine which variables to adjust for
  • Recognise the danger of conditioning on colliders (Berkson's paradox)
  • Use DAGs to make identification assumptions explicit and challengeable

When randomisation isn't available, you have OBSERVATIONAL data. The observed association is NOT the ATE; the gap is called CONFOUNDING. DAGs — Directed Acyclic Graphs — are the modern formalism developed by Pearl (1995, 2009) for thinking about confounding precisely. They tell you WHICH variables to adjust for, and WHICH NOT TO. Get this wrong — especially by adjusting for a collider — and your "controlled" analysis is more biased than the unadjusted one.

DAG basics

A DAG is a graph with VARIABLES as nodes and DIRECT CAUSAL EFFECTS as directed arrows. The arrows represent the direction in which the data-generating process moves: X → Y means "X is one of the things that causally determines Y" (along with X's coefficient and any noise terms).

The "acyclic" part: no variable can be its own ancestor. You can't have X → Y → X. This rules out feedback loops within the same timestep.

Three canonical node configurations

Every DAG is built from three local patterns:

  • Chain: X → Z → Y. Information flows from X to Y through Z. The variable Z MEDIATES the effect of X on Y.
  • Fork: X ← Z → Y. Z is a common CAUSE of X and Y. Z is a CONFOUNDER — an open back-door path between X and Y that biases the X-Y association.
  • Collider: X → Z ← Y. Z is a common EFFECT of X and Y. The path X → Z ← Y is BLOCKED by default. Conditioning on Z OPENS the path, creating a spurious association.

The back-door criterion

To identify the causal effect of T on Y in a DAG, find a set Z of variables such that:

  • Z blocks ALL "back-door" paths from T to Y — paths that start with an arrow INTO T.
  • Z contains NO descendants of T.

If such a Z exists and is observable, the ATE is identified by adjustment:

E[Y(t)]=EZ ⁣[E[YT=t,Z]].E[Y(t)] = E_Z\!\left[ E[Y \mid T = t, Z] \right].

This is the formal justification for "include the confounder in the regression". It works only when Z really does block all back-door paths AND contains no descendants of T.

Confounder vs mediator vs collider — opposite adjustment rules

The three configurations look almost identical in raw data: T, Y, and some third variable Z. But the correct treatment differs SIGN-WISE:

ConfigurationRole of ZAdjust for Z?Why
ForkConfounderYESBlocks the back-door path T ← Z → Y; removes bias.
ChainMediatorDependsAdjusting BLOCKS the indirect path; gives direct effect only. Don't adjust if you want the TOTAL effect.
ColliderCommon effectNOConditioning OPENS the path T → Z ← Y, creating spurious correlation (Berkson's paradox).

Berkson's paradox: the collider trap

Berkson (1946) noticed that hospitalised patients show a NEGATIVE correlation between two diseases A and B, even though A and B are independent in the population. Why? Because being hospitalised (Z) is a common effect of both diseases — either disease alone is bad enough to put you in hospital. Conditioning on Z = hospitalised filters to a population where both can't be high (one or the other suffices), creating a spurious negative correlation.

The general lesson: every "controlled for variables X" claim is a DAG claim. Adjusting for the wrong X variables can INTRODUCE bias rather than remove it.

Worked example: smoking, lung cancer, and tar

Consider the DAG: Smoking → Tar deposit in lungs → Lung Cancer, with Smoking also having a direct effect on Lung Cancer through other mechanisms.

  • To estimate the TOTAL causal effect of Smoking on Lung Cancer: DON'T adjust for Tar (it's a mediator).
  • To estimate the DIRECT effect of Smoking on Lung Cancer (mechanisms other than tar): DO adjust for Tar.
  • If there's an unobserved Z that causes both Smoking and Lung Cancer (e.g., Genetics): adjusting for Z would help, but if Z is unobserved you have residual confounding — document this and run sensitivity analysis (§6.8).

Using DAGs in research practice

  • Before any regression: DRAW the DAG. Be explicit about what causes what.
  • For each candidate "control variable", classify it: confounder, mediator, collider, or descendant of T.
  • Apply the back-door criterion to choose the adjustment set.
  • Report the DAG with the paper — let reviewers challenge your causal assumptions.
  • Sensitivity analysis (§6.8) bounds the damage if unobserved confounders exist.

Dag Confounder ExplorerInteractive figure — enable JavaScript to interact.

Try it

  • Start with Confounder Z, true τ = 1.0. Note the unadjusted OLS estimate is heavily biased (~2.0-3.0) — confounding by Z. The adjusted OLS recovers ≈ 1.0. Adjustment is the cure.
  • Set τ = 0 under the confounder DAG. The unadjusted estimate is still POSITIVE — pure confounding bias with NO true effect. Adjusting recovers ≈ 0.
  • Switch to Mediator Z with τ = 1.0. Note: the true DIRECT effect (set in the simulator) is 0.5; the unadjusted estimate captures the TOTAL effect (direct + indirect via Z); the adjusted estimate captures the DIRECT only. Both are "correct" — they answer different questions.
  • Switch to Collider Z. True effect of T on Y is 0 in this DAG. Unadjusted OLS correctly reports ~0. Adjusted OLS reports a SPURIOUS positive or negative estimate — Berkson's paradox in action. NEVER adjust for colliders.
  • Hit "re-sample data" a few times under the collider DAG. The spurious adjusted estimate is consistent across seeds — it's structural bias, not noise.

Hospital admissions are a common-effect of both severity of injury (T) and presence of underlying disease (Y). A study of hospitalised patients shows injury severity is negatively correlated with disease prevalence. Is this evidence that severe injury PROTECTS against disease, or is it a collider bias?

What you now know

DAGs are the language of modern causal inference. The back-door criterion tells you which variables to adjust for: confounders YES, colliders NO, mediators (depends on whether you want total or direct effect). The widget demonstrates each configuration with simulated data — see the bias appear and disappear under different adjustment strategies. §6.4 turns to PROPENSITY-SCORE methods, which scale this back-door adjustment to high-dimensional X by collapsing it into a one-dimensional summary.

References

  • Pearl, J. (1995). "Causal diagrams for empirical research." Biometrika 82(4), 669–688. (The foundational DAG paper for empirical work.)
  • Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press. (The canonical book-length development of DAGs + do-calculus.)
  • Greenland, S., Pearl, J., Robins, J.M. (1999). "Causal diagrams for epidemiologic research." Epidemiology 10(1), 37–48. (The epidemiology classic introducing DAGs to applied health research.)
  • Hernán, M.A., Robins, J.M. (2020). Causal Inference: What If. Chapman & Hall. (Chapter 6 develops DAGs with extensive worked examples.)
  • Cinelli, C., Forney, A., Pearl, J. (2024). "A crash course in good and bad controls." Sociological Methods & Research. (Modern catalogue of which variables to include/exclude in causal regressions.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.