Propensity scores, matching, and IPTW

Part 6 — Causal inference for researchers

Learning objectives

  • Define the propensity score e(X) = P(T = 1 | X)
  • Apply propensity-score matching to construct comparable treated/control groups
  • Apply inverse-probability-of-treatment weighting (IPTW) to reweight units back to the marginal population
  • Diagnose OVERLAP / POSITIVITY violations
  • Recognise the crucial limitation: propensity-score methods adjust for OBSERVED confounders only

The back-door criterion (§6.3) prescribes adjustment by some set Z of observed confounders. With high-dimensional X (10s or 100s of variables), direct regression adjustment becomes unwieldy and model-dependent. PROPENSITY-SCORE methods (Rosenbaum & Rubin 1983) provide a powerful computational shortcut: collapse all of X into a one-dimensional summary e(X) = P(T = 1 | X), then adjust on this scalar score.

The propensity score

The propensity score is the conditional probability of treatment given covariates:

e(x)=P(T=1X=x).e(\mathbf{x}) = P(T = 1 \mid \mathbf{X} = \mathbf{x}).

Estimated in practice via logistic regression of T on X (or richer models — random forest, gradient boosting, generalised additive models).

The Rosenbaum-Rubin (1983) theorem

Their foundational result: under conditional ignorability — T independent of (Y(0), Y(1)) given X — we have

T(Y(0),Y(1))e(X).T \perp (Y(0), Y(1)) \mid e(X).

That is, treatment is unconfounded GIVEN THE PROPENSITY SCORE ALONE. The high-dimensional X collapses into a scalar. Adjusting on e(X) is sufficient for identification under the ignorability assumption.

Propensity-score matching

For each treated unit i, find the control unit j with the closest e(X) — the "match". Compare each pair's outcomes. The matched-pair analysis estimates the ATT (effect on the treated):

τ^ATT=1NTi:Ti=1(YiYm(i)),\hat{\tau}_{\text{ATT}} = \frac{1}{N_T} \sum_{i: T_i = 1} (Y_i - Y_{m(i)}),

where m(i)m(i) is the matched control. Variants: nearest-neighbour, k-nearest-neighbours, optimal matching, full matching. Modern best practice uses caliper-based matching (drop unmatched units beyond a maximum propensity distance) to enforce overlap.

Inverse Probability of Treatment Weighting (IPTW)

Reweight each unit by the inverse of its propensity:

τ^IPTW=1Ni[TiYie(Xi)(1Ti)Yi1e(Xi)].\hat{\tau}_{\text{IPTW}} = \frac{1}{N} \sum_i \left[ \frac{T_i Y_i}{e(X_i)} - \frac{(1 - T_i) Y_i}{1 - e(X_i)} \right].

Treated units with small e(X) (rare for that X) get LARGE weight. Control units with large e(X) (usually treated for that X) get LARGE weight. The reweighted population looks like a random sample at the marginal X distribution.

The overlap / positivity assumption

For propensity-score methods to work, every X region must have BOTH treated and control units. Formally:

0<e(x)<1for all x.0 < e(\mathbf{x}) < 1 \quad \text{for all } \mathbf{x}.

If e(X) is near 0 or 1 for some units, the IPTW weights explode (1/0.01 = 100), making the estimator unstable. Practical responses:

  • TRIM: exclude units with e(X) outside [0.05, 0.95]. Loses information but stabilises the estimator.
  • STABILISED weights: w_i = e(X_i)·T + (1-e(X_i))·(1-T) / [e(X_i)·T + (1-e(X_i))·(1-T)] — scales the weights to have mean 1.
  • Diagnose first: look at the distribution of e(X) by T. If treated and control distributions barely overlap, your sample doesn't have a comparable counterfactual.

The crucial caveat: only observed confounders

Propensity-score methods CANNOT recover the true ATE if there's an unmeasured confounder U. The propensity model e(X) is built from observed X only; U doesn't appear; conditioning on e(X) doesn't balance U; residual confounding remains. Sensitivity analysis (§6.8) quantifies how bad an unobserved confounder would have to be to overturn conclusions.

The honest framing: propensity-score methods are excellent for what they do — efficient observed-confounder adjustment — but they are NOT a substitute for randomisation. If you wrote down a DAG (§6.3) with all your variables AND your data-generating process really matches that DAG, propensity-score adjustment is valid. The strength of the conclusion is bounded by the strength of those assumptions.

Propensity OverlapInteractive figure — enable JavaScript to interact.

Try it

  • Start with α = 0.6 (moderate selection), τ = 2.0. Observe the treated (green, upper) and control (red, lower) X distributions diverge — treated has more high-X, control has more low-X.
  • Naive diff-in-means is biased because the comparison isn't apples-to-apples. IPTW with correct propensity recovers ATE ≈ 2.0.
  • Crank α to 2.0 (extreme selection). Now treated is almost entirely high-X; control almost entirely low-X. The propensity range shows minimum near 0 or maximum near 1 — POSITIVITY VIOLATED. IPTW weights explode; the estimator becomes unstable.
  • Set τ = 0. Naive diff is far from zero (pure confounding by X). IPTW recovers ≈ 0.
  • Hit re-sample under high α. Note: the bias from positivity violations doesn't average out across seeds; it's a structural problem with the data.

A study uses propensity-score matching with caliper width 0.1 on the logit scale. Of 1000 treated units, only 700 find a match within the caliper. Is dropping the 300 unmatched units (a) harmless because their treatment effect is undefined, or (b) selection-inducing bias requiring documentation?

What you now know

The propensity score collapses high-dimensional confounder adjustment into a one-dimensional scalar. Matching and IPTW are the two primary adjustment strategies. Positivity (every X has BOTH treated and control) is the make-or-break condition; without it, the methods are unreliable. And crucially: propensity-score methods adjust for OBSERVED confounders only — they require the no-unmeasured-confounders assumption. §6.5 turns to INSTRUMENTAL VARIABLES, the canonical observational tool when unmeasured confounding is suspected.

References

  • Rosenbaum, P.R., Rubin, D.B. (1983). "The central role of the propensity score in observational studies for causal effects." Biometrika 70(1), 41–55. (The foundational paper.)
  • Imbens, G.W., Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press. (Chapters 12–14 develop the propensity-score machinery.)
  • Abadie, A., Imbens, G.W. (2006). "Large sample properties of matching estimators for average treatment effects." Econometrica 74(1), 235–267.
  • Stuart, E.A. (2010). "Matching methods for causal inference: A review and a look forward." Statistical Science 25(1), 1–21. (Comprehensive applied review.)
  • Hirano, K., Imbens, G.W., Ridder, G. (2003). "Efficient estimation of average treatment effects using the estimated propensity score." Econometrica 71(4), 1161–1189.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.