Capstone: Reproducibility audit and multiverse analysis

Part 10, Real-research capstones

Learning objectives

Recognise analytic degrees of freedom as a source of replication failure
Run a multiverse analysis: vary outlier handling, transforms, covariates, missing data, all on a single dataset
Read the specification curve as a graphical summary of robustness
Report the FRACTION of pipelines that reach statistical significance
Recognise when one chosen pipeline is robust vs. when the headline depends on the choices

The final capstone audits a single research project against its OWN analytic choices. Steegen, Tuerlinckx, Gelman & Vanpaemel (2016) called this multiverse analysis: run every defensible analytic pipeline on the same dataset and report the FULL distribution of results, not just the one the analyst happened to choose. It directly addresses the replication crisis by exposing how much "publication-quality findings" depend on small undisclosed analytic choices.

The "garden of forking paths"

Gelman & Loken (2014) coined this term for the proliferation of plausible analytic pipelines:

OUTLIER HANDLING: drop |z| > 3? Winsorize at 95th percentile? Keep all?
VARIABLE TRANSFORM: log, square-root, arcsinh, or leave Y on the original scale?
COVARIATE SET: which covariates to adjust for? Pre-registered list, or post-hoc?
MISSING DATA: complete-case, mean imputation, multiple imputation?
SUBGROUP DEFINITION: which thresholds for "old" vs "young", "high" vs "low"?

Each choice is defensible. But without pre-registration, the analyst implicitly chooses the SUBSET that produces a publishable result. The reported p-value then radically over-states the strength of evidence.

The multiverse fix

Run EVERY plausible combination of choices, plot the distribution of β̂ and p-values across pipelines. Headline metrics:

FRACTION of pipelines reaching p < 0.05.
RANGE of β̂ estimates.
Whether the headline finding is robust (most pipelines agree) or fragile (the conclusion flips across pipelines).

If 80 of 81 pipelines produce p < 0.05 with β̂ in [0.18, 0.22], the headline is ROBUST. If only 30 of 81 produce p < 0.05 and β̂ spans [-0.1, 0.3], the headline depends on choices the analyst made, a sensitivity problem that should be reported transparently.

The specification curve

Simonsohn, Simmons & Nelson (2020) introduced this graphical summary: each row is one analytic pipeline; pipelines are sorted by β̂; coloured markers indicate significance and direction. A clean "specification curve" that mostly lies above (or below) zero across all pipelines suggests robust effects. A curve that crosses zero suggests dependence on analytic choices.

Try it

Defaults (true effect = 0.2, n = 300). The multiverse shows 81 pipelines clustered around β̂ ≈ 0.2 with most (~70-80) significant. The specification curve is tight; conclusion is ROBUST.
Set true effect = 0 (null hypothesis). Now examine the fraction p < 0.05, under no selection, we'd expect ≈ 5%. But running 81 pipelines and reporting only significant ones inflates this enormously: you may see 10-25% significant by pure analytic flexibility. This is the multiple-testing problem hidden inside analytic choices.
Reduce n to 100 (small sample). All pipelines widen; the specification curve becomes much wider; some pipelines are significant, others not. The headline now strongly depends on which pipeline you picked, exactly the situation pre-registration is meant to prevent.
Set true effect = -0.3 (moderate negative effect). Most pipelines should converge on a negative β̂. The specification curve crosses zero only for the most outlier-aggressive choices, showing that even legitimate analytic flexibility can't flip a moderately-strong signal.
Try different seeds. Notice that the QUALITATIVE picture (robust vs fragile) is reasonably seed-stable, but the EXACT fraction-significant varies. Multiverse analysis is a SAMPLING distribution over analytic choices, interpret the distribution, not point counts.

A published study reports β̂ = 0.4 with p = 0.03 under one specific pipeline (drop |z| > 3, log-transform, full covariate set, complete-case). You re-run the multiverse and find only 12 of 81 pipelines give p < 0.05, and β̂ ranges from -0.1 to 0.5. What is the appropriate interpretation, and how would you communicate this in your replication report?

What you now know

You can run a multiverse analysis to expose analytic-choice dependence. You read the specification curve as a fragility diagnostic. You understand that running 81 pipelines without pre-registration inflates Type-I error well above the nominal 5%. The fix is methodological (pre-registration, multiverse-by-default, transparent reporting), not statistical (no single test handles the implicit multiple testing across analytic choices).

References

Steegen, S., Tuerlinckx, F., Gelman, A., Vanpaemel, W. (2016). "Increasing transparency through a multiverse analysis." Perspectives on Psychological Science 11(5), 702-712.
Gelman, A., Loken, E. (2014). "The statistical crisis in science." American Scientist 102, 460-465.
Simonsohn, U., Simmons, J.P., Nelson, L.D. (2020). "Specification curve analysis." Nature Human Behaviour 4, 1208-1214.
Nosek, B.A. et al. (2018). "The preregistration revolution." PNAS 115(11), 2600-2606.
Open Science Collaboration (2015). "Estimating the reproducibility of psychological science." Science 349(6251), aac4716.