Propensity scores, matching, and IPTW

Part 6 — Causal inference for researchers

Learning objectives

Define the propensity score e(X) = P(T = 1 | X)
Apply propensity-score matching to construct comparable treated/control groups
Apply inverse-probability-of-treatment weighting (IPTW) to reweight units back to the marginal population
Diagnose OVERLAP / POSITIVITY violations
Recognise the crucial limitation: propensity-score methods adjust for OBSERVED confounders only

The back-door criterion (§6.3) prescribes adjustment by some set Z of observed confounders. With high-dimensional X (10s or 100s of variables), direct regression adjustment becomes unwieldy and model-dependent. PROPENSITY-SCORE methods (Rosenbaum & Rubin 1983) provide a powerful computational shortcut: collapse all of X into a one-dimensional summary e(X) = P(T = 1 | X), then adjust on this scalar score.

The propensity score

The propensity score is the conditional probability of treatment given covariates:

e(\mathbf{x}) = P(T = 1 \mid \mathbf{X} = \mathbf{x}).

Estimated in practice via logistic regression of T on X (or richer models — random forest, gradient boosting, generalised additive models).

The Rosenbaum-Rubin (1983) theorem

Their foundational result: under conditional ignorability — T independent of (Y(0), Y(1)) given X — we have

T \perp (Y(0), Y(1)) \mid e(X).

That is, treatment is unconfounded GIVEN THE PROPENSITY SCORE ALONE. The high-dimensional X collapses into a scalar. Adjusting on e(X) is sufficient for identification under the ignorability assumption.

Propensity-score matching

For each treated unit i, find the control unit j with the closest e(X) — the "match". Compare each pair's outcomes. The matched-pair analysis estimates the ATT (effect on the treated):

\hat{\tau}_{\text{ATT}} = \frac{1}{N_T} \sum_{i: T_i = 1} (Y_i - Y_{m(i)}),

where $m(i)$ is the matched control. Variants: nearest-neighbour, k-nearest-neighbours, optimal matching, full matching. Modern best practice uses caliper-based matching (drop unmatched units beyond a maximum propensity distance) to enforce overlap.

Inverse Probability of Treatment Weighting (IPTW)

Reweight each unit by the inverse of its propensity:

\hat{\tau}_{\text{IPTW}} = \frac{1}{N} \sum_i \left[ \frac{T_i Y_i}{e(X_i)} - \frac{(1 - T_i) Y_i}{1 - e(X_i)} \right].

Treated units with small e(X) (rare for that X) get LARGE weight. Control units with large e(X) (usually treated for that X) get LARGE weight. The reweighted population looks like a random sample at the marginal X distribution.

The overlap / positivity assumption

For propensity-score methods to work, every X region must have BOTH treated and control units. Formally:

0 < e(\mathbf{x}) < 1 \quad \text{for all } \mathbf{x}.

If e(X) is near 0 or 1 for some units, the IPTW weights explode (1/0.01 = 100), making the estimator unstable. Practical responses:

TRIM: exclude units with e(X) outside [0.05, 0.95]. Loses information but stabilises the estimator.
STABILISED weights: w_i = e(X_i)·T + (1-e(X_i))·(1-T) / [e(X_i)·T + (1-e(X_i))·(1-T)] — scales the weights to have mean 1.
Diagnose first: look at the distribution of e(X) by T. If treated and control distributions barely overlap, your sample doesn't have a comparable counterfactual.

The crucial caveat: only observed confounders

Propensity-score methods CANNOT recover the true ATE if there's an unmeasured confounder U. The propensity model e(X) is built from observed X only; U doesn't appear; conditioning on e(X) doesn't balance U; residual confounding remains. Sensitivity analysis (§6.8) quantifies how bad an unobserved confounder would have to be to overturn conclusions.

The honest framing: propensity-score methods are excellent for what they do — efficient observed-confounder adjustment — but they are NOT a substitute for randomisation. If you wrote down a DAG (§6.3) with all your variables AND your data-generating process really matches that DAG, propensity-score adjustment is valid. The strength of the conclusion is bounded by the strength of those assumptions.

Try it

Start with α = 0.6 (moderate selection), τ = 2.0. Observe the treated (green, upper) and control (red, lower) X distributions diverge — treated has more high-X, control has more low-X.
Naive diff-in-means is biased because the comparison isn't apples-to-apples. IPTW with correct propensity recovers ATE ≈ 2.0.
Crank α to 2.0 (extreme selection). Now treated is almost entirely high-X; control almost entirely low-X. The propensity range shows minimum near 0 or maximum near 1 — POSITIVITY VIOLATED. IPTW weights explode; the estimator becomes unstable.
Set τ = 0. Naive diff is far from zero (pure confounding by X). IPTW recovers ≈ 0.
Hit re-sample under high α. Note: the bias from positivity violations doesn't average out across seeds; it's a structural problem with the data.

A study uses propensity-score matching with caliper width 0.1 on the logit scale. Of 1000 treated units, only 700 find a match within the caliper. Is dropping the 300 unmatched units (a) harmless because their treatment effect is undefined, or (b) selection-inducing bias requiring documentation?

What you now know

The propensity score collapses high-dimensional confounder adjustment into a one-dimensional scalar. Matching and IPTW are the two primary adjustment strategies. Positivity (every X has BOTH treated and control) is the make-or-break condition; without it, the methods are unreliable. And crucially: propensity-score methods adjust for OBSERVED confounders only — they require the no-unmeasured-confounders assumption. §6.5 turns to INSTRUMENTAL VARIABLES, the canonical observational tool when unmeasured confounding is suspected.

References

Rosenbaum, P.R., Rubin, D.B. (1983). "The central role of the propensity score in observational studies for causal effects." Biometrika 70(1), 41–55. (The foundational paper.)
Imbens, G.W., Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press. (Chapters 12–14 develop the propensity-score machinery.)
Abadie, A., Imbens, G.W. (2006). "Large sample properties of matching estimators for average treatment effects." Econometrica 74(1), 235–267.
Stuart, E.A. (2010). "Matching methods for causal inference: A review and a look forward." Statistical Science 25(1), 1–21. (Comprehensive applied review.)
Hirano, K., Imbens, G.W., Ridder, G. (2003). "Efficient estimation of average treatment effects using the estimated propensity score." Econometrica 71(4), 1161–1189.