Preregistration and the garden of forking paths

Part 2 — Hypothesis testing without p-hacking

Learning objectives

  • Define the GARDEN OF FORKING PATHS (Gelman & Loken 2014): a real analysis contains many implicit choices — variables to include, transformations, outlier rules, subgroups, modelling decisions — and each choice is a fork; if the researcher follows the path that gives the most publishable result, the effective Type-I rate inflates EVEN WITHOUT conscious p-hacking
  • Distinguish the FORKING PATHS from §2.4's P-HACKING: p-hacking is a small menu of concrete cheating strategies (optional stopping, multi-outcome fishing, post-hoc outlier dropping); the forking paths is the much larger menagerie of innocent-looking choices, any single one of which is defensible but whose collective freedom inflates the false-positive rate
  • Quantify the inflation: with 192 defensible paths (5-covariate subsets × log/raw × 3 outlier rules × interaction yes/no on a moderately-correlated design) under H₀, the empirical Type-I rate under best-path selection lands roughly 2.5–3.5× the nominal α (around 12–18% when α = 0.05), vs ~ 5% under pre-specification; the inflation factor depends on how correlated the paths are — uncorrelated paths would push the rate toward 1 − (1 − α)^192 ≈ 99.99%, but realistic path correlations cap it well below that worst case
  • State the PREREGISTRATION mechanism: write down the analysis plan BEFORE seeing the data — hypotheses, sample size, statistical tests, decision rules, what counts as confirmatory vs exploratory — time-stamp it, and make it public
  • List the four classes of platform: OSF (Open Science Framework, comprehensive, journal-agnostic), AsPredicted (lightweight, 9-question template), ClinicalTrials.gov (legally mandated for FDA-regulated US trials post-2007), EU CTR (parallel for EU drug regulation)
  • Define REGISTERED REPORTS (Chambers 2013): a publishing model where the journal accepts the paper based on the QUESTION + METHODS, BEFORE seeing the results; eliminates publication bias against null findings; ~ 200 journals offered this format by 2024 (COS 2024 inventory)
  • Identify what preregistration is NOT: not a guarantee of correctness (bad questions, bad measurement, low n still all matter); not a prohibition on exploratory analysis (which remains a valid hypothesis-generating mode, just clearly labelled); not the only solution to the replication crisis
  • Walk the 9-STEP TEMPLATE: research question, hypotheses + directionality, sample size + power calc, inclusion/exclusion, outlier handling, primary analysis, multiple-testing correction, decision rule, confirmatory vs exploratory declaration
  • Map each step to a SPECIFIC FAILURE MODE it forecloses: step 2 → post-hoc directional switch, step 3 → optional stopping, step 4 → subgroup carving, step 5 → outlier p-hacking, step 6 → forking paths, step 7 → multi-outcome fishing, step 8 → significance-only inference, step 9 → confirmatory-claim sneaking
  • Recognise the REPORTING GUIDELINES that complement preregistration: CONSORT (Moher et al. 2010) for parallel-group RCTs, STROBE (von Elm et al. 2007) for observational studies, PRISMA (Liberati et al. 2009) for systematic reviews
  • Internalise the EVIDENCE base: Nosek et al. (2018) review the impact of preregistration on replication rates; John, Loewenstein & Prelec (2012) measured the prevalence of questionable research practices (1–66% across 10 categories in a 2155-psychologist sample); Munafò et al. (2017) consolidate the procedural recommendations in the Nature Human Behaviour manifesto for reproducible science
  • Preview §2.7 (equivalence testing — the TOST machinery for supporting 'no meaningful effect') and §2.8 (the replication crisis — what happens to a literature when these procedural disciplines fail systematically)

§2.4 dissected what a p-value is, and what it is not, and previewed four p-hacking mechanisms — optional stopping, multi-outcome fishing, post-hoc outlier dropping, and the GARDEN OF FORKING PATHS — that inflate the operational Type-I rate above the nominal α. §2.5 formalised the multi-testing correction problem and laid out the FWER and FDR machinery (Bonferroni, Holm, Benjamini-Hochberg). All of that machinery has a load-bearing assumption: the FAMILY of tests is FIXED in advance. If the researcher adds tests post-hoc, or drops tests that gave large p-values, or quietly switches the transform / outlier rule / covariate set until the answer looks clean, the inflation guarantee dissolves. §2.6 is the procedural section: how do you commit, BEFORE the data lands, to a single confirmatory analysis path so that §2.5's guarantees actually apply?

The §2.6 arc has five stops. First, the GARDEN OF FORKING PATHS — Gelman & Loken's (2014) metaphor for the realistic menagerie of innocent-looking analytic choices that surround any real dataset. Second, PREREGISTRATION as the procedural fix: write down the plan BEFORE the data, time-stamp it, make it public. Third, the 9-STEP TEMPLATE — research question, hypotheses, sample size, inclusion/exclusion, outliers, primary analysis, multiple-testing correction, decision rule, confirmatory-vs-exploratory declaration — with each step mapped to one specific failure mode it forecloses. Fourth, REGISTERED REPORTS (Chambers 2013) as the publishing-model variant that addresses publication bias too. Fifth, HONEST CAVEATS — what preregistration does not fix (low n, bad measurement, bad questions, exploratory work) — and the REPORTING GUIDELINES (CONSORT, STROBE, PRISMA) that complement it for specific study types.

One framing note before the metaphor. Preregistration is not a confession that researchers are dishonest. The standard case for it does NOT require the assumption that researchers are p-hackers. Gelman & Loken's (2014) entire point is that the inflation arises EVEN under perfect-good-faith analysis when the data-dependent choice menu is large. Preregistration is to the analyst what a sealed envelope is to a magician: not a moral safeguard but a procedural one. With the envelope sealed, the magician's sincerity is irrelevant — the audience can trust the trick by checking the envelope. Without the envelope, even an honest magician cannot be distinguished from a clever one. §2.6 is about sealing the envelope.

The garden of forking paths

Gelman & Loken (2014, American Scientist) coined the metaphor borrowing from Jorge Luis Borges' 1941 short story "The Garden of Forking Paths" — a labyrinth where every choice leads to a different future. In the statistical version, the labyrinth is the menu of analytic choices a thoughtful researcher faces when confronted with a real dataset:

  • Which variables. Five candidate covariates → 2⁵ = 32 inclusion subsets. With the focal predictor required, 16 non-trivial subsets remain.
  • Which transformations. Raw outcome vs log vs square-root vs rank — usually 2–4 options that a domain reviewer would all accept.
  • Which outlier rules. No exclusion vs > 2σ drop vs > 3σ drop vs studentised-residual > 2.5 vs Tukey-fence vs domain-specific (e.g., reaction time < 100 ms in cognitive psych). 3–6 options that each have published precedent.
  • Which functional form. Linear vs quadratic vs interaction-with-X2 vs B-spline. Adding an interaction is "more complete"; not adding it is "more parsimonious"; both are defensible.
  • Which subgroup. Pre-specified by design, or post-hoc identified ("the effect is in the females") — the latter is the §2.4 mechanism, but the former is also a fork if multiple pre-specified subgroups are in play and only the significant one is reported.
  • Which test. Welch t vs Student t vs Wilcoxon vs permutation vs Bayesian-BF. Each has different assumptions and different p-values.
  • Which standard errors. OLS vs robust (HC0/HC1/HC2/HC3) vs cluster-robust vs bootstrap. Pick the smallest SE → significant.

The product of even a modest fork count is large. Two covariate subsets × two transforms × three outlier rules × two interaction options is already 24 paths. Five covariate subsets × two transforms × three outlier rules × two interaction options is 60 paths. A six-fork analysis routinely surfaces 100+ paths. The key Gelman-Loken (2014) observation: a researcher confronted with such a menu need not be CONSCIOUSLY p-hacking. The researcher may sincerely walk down the most reasonable-looking path FIRST given the data — and the data, being random, suggest "the most reasonable-looking path" in different directions in different replications. This is the implicit selection that inflates Type-I.

The first widget operationalises this on a specific design. 5 candidate covariates, one of which (X1) is the focal predictor; under H₀, NONE of the covariates affects the outcome Y; n = 60. The reader can manually pick one path; the widget reports the p-value for the X1 coefficient on the current dataset. Then the widget runs all 192 defensible paths on the SAME dataset and reports the minimum p across paths. Then it repeats over many replicate datasets to estimate the empirical Type-I rate under best-path selection.

Forking Paths SimulatorInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • Run with the default seed. The "best-of-192" empirical Type-I rate should land in the 12–18% range across the 500 replicate datasets, vs the "pre-specified single path" rate of around 5%. The first rate is the cost of having 192 defensible analyses on the table without pre-registering which one is the confirmatory test. The second rate is what α = 0.05 actually delivers when the path is fixed. The inflation factor is roughly 2.5–3.5× the nominal α — modest by the standards of a fully-independent multiple-testing problem, but enough to triple the false-positive rate of the published literature.
  • Click around the manual path controls and watch the p for the X1 coefficient. With the same dataset, you should be able to find a path with p < 0.05 (and often several). All of these are FALSE POSITIVES because the dataset was generated under H₀. Notice how the inflation works: each individual path is locally defensible, but the freedom to pick among them is the actual mechanism.
  • Re-roll the dataset a few times. On any single dataset, the histogram shows the distribution of p-values across paths — typically a few percent of the 192 paths will report p < 0.05 by chance even under H₀. The headline "best-path Type-I rate" is the rate at which AT LEAST ONE of the 192 paths cracks 0.05, taken over replicate datasets. With heavy path correlation (many paths share most of their structure on a given dataset) this rate is well below the fully-independent worst case but still well above the nominal α.
  • Slide the replicate datasets count up to 5000. The Monte Carlo noise on the best-path Type-I estimate shrinks (from ±2 pp at 500 datasets down to ±0.5 pp at 5000); the headline rate stays stable. The exact rate is a function of THIS design (5 covariates with ρ12 = 0.3, signed-log option, 3 outlier rules, interaction option). Different designs give different rates; the general phenomenon is the same.
  • The path correlation structure caps the inflation FAR below the fully-independent worst case 10.9519299.99%1 - 0.95^{192} \approx 99.99%. The empirically observed rate (around 14% on this design) reflects that many paths reuse the same X1 noise structure on the same dataset, so their p-values are highly correlated. Even with that heavy correlation, the rate is roughly 3× the nominal 5%. The cost of analytic freedom is not the 99.99% upper bound — it is the moderately-correlated middle. In real research with less-correlated paths (e.g., several genuinely different outcome variables instead of one outcome with multiple transforms), the rate can climb much higher.

Preregistration: sealing the envelope

Preregistration is the procedural fix. The mechanism, in three sentences. (1) Before collecting data — or before LOOKING at the data, if it is archival — write down the analysis plan in enough detail that someone else could execute it. (2) Submit the plan to a third-party time-stamped registry (OSF, AsPredicted, ClinicalTrials.gov, EU CTR). (3) After data collection, run the pre-specified analysis exactly as written; report it as the confirmatory result; any additional analyses are clearly labelled exploratory.

What "in enough detail" means is the heart of the discipline. The plan must collapse all the forks above to ONE path per primary hypothesis. Specifically:

  • Sample size, with justification. "n = 80 per arm" alone is a number; "n = 80 per arm, power = 0.80 at d = 0.45, α = 0.05 two-sided, G*Power 3.1" is a JUSTIFIED commitment. The justification matters because it forecloses optional stopping — if you commit to 80, you stop at 80 regardless of interim p-values.
  • Inclusion / exclusion criteria. Specifically: who is eligible (the study population) and who is dropped AFTER enrolment (e.g., for protocol violation, missing data, baseline measurements). Pre-specifying both prevents post-hoc subgroup carving.
  • Outlier rule. "Drop observations with |y − ȳ| > 3·s, computed on the baseline assessment, within group" is a rule. "Drop outliers" is not — it leaves the threshold and the reference distribution open. Without a rule, the §2.4 outlier-drop p-hacking mechanism is wide open.
  • Primary analysis. The ONE test — statistic, alternative, α — that decides H₁. Naming this collapses 192 forks to one. Secondary analyses are listed separately with their own corrections.
  • Multiple-testing correction. If there are k > 1 primary tests, which §2.5 method applies and at what level. "BH FDR at q = 0.05" is a commitment; "we will correct for multiple testing" is not.
  • Decision rule. What conclusion does each (p, effect, CI) combination support? Pre-specifying the rule — "reject H₀ if p < 0.05 AND |d| ≥ 0.20" — prevents the "we found p = 0.04 with d = 0.05" claim that is statistically significant but practically meaningless.
  • Confirmatory vs exploratory. A single sentence: which analyses are pre-specified (confirmatory)? Which are exploratory? The latter are still valuable for hypothesis generation, but they cannot be claimed as confirmatory in the abstract.

The discipline is structured, not arbitrary. The second widget walks the 9-step template and produces a YAML preregistration document the reader can copy to OSF (osf.io/prereg) or AsPredicted (aspredicted.org). Each step has a "why?" toggle naming the specific failure mode the step forecloses.

Preregistration BuilderInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • Click "Load example". The widget fills all nine fields with a worked CBT-anxiety RCT preregistration. Read the YAML preview at the bottom — it is a complete, registrable analysis plan in roughly 200 lines. The number of decisions made up front is striking; that is the point.
  • Click "why?" on step 3 (sample size). The tooltip names the failure mode foreclosed: optional stopping (Armitage, McPherson & Rowe 1969). Pre-stating n locks in the stopping rule. Compare with step 6 (primary analysis): the tooltip names the garden of forking paths (Gelman & Loken 2014). Pre-naming the primary collapses the path menagerie.
  • Clear the form and try drafting a preregistration for a study you have in mind, real or hypothetical. The "9 of 9 remaining" counter ticks down as each field reaches 10+ characters. The exercise is reflective: the FIRST step (research question) is almost always the hardest, because real research questions are usually less crisp than the literature presents them.
  • Use the Copy YAML button. The resulting block is structurally what an OSF or AsPredicted submission expects. (Both platforms also accept free-form text, but the YAML structure forces you to address each step.)
  • The 9 steps are not exhaustive — a clinical trial preregistration on ClinicalTrials.gov also requires the intervention description, the recruitment plan, the data-monitoring committee composition, etc. But the 9 steps are the LOAD-BEARING ones for the statistical-inference part of the discipline: they collapse the forking-paths menagerie and lock the §2.5 family.

Registries and registered reports

Preregistration only works if the plan is TIME-STAMPED and PUBLIC. Time-stamping prevents the "we always planned this" rewrite; public visibility makes the plan checkable by reviewers, readers, and future replicators. Four classes of platform cover most working researchers:

  • OSF (Open Science Framework, osf.io). Comprehensive, journal-agnostic, free. Hosts preregistrations alongside the rest of the project (data, code, manuscripts). Multiple preregistration templates including AsPredicted, PRP (Preregistration in Social Psychology), and a free-form template. Run by the Center for Open Science (Nosek et al. 2015, Science).
  • AsPredicted (aspredicted.org). Lightweight, 9-question template designed for psychology. Faster than OSF but less flexible. Run by the University of Pennsylvania.
  • ClinicalTrials.gov. Legally REQUIRED for FDA-regulated US clinical trials by the FDA Amendments Act 2007 §801. Outcome registry mandates that the primary outcome and secondary outcomes be declared before enrolment closes. Covers ~ 400,000 trials by 2024.
  • EU CTR (EudraCT / EU Clinical Trials Register). EU equivalent of ClinicalTrials.gov, mandated by EU Regulation 536/2014. Required for any EU-conducted trial of a regulated medicinal product.
  • PROSPERO. The international prospective register of systematic reviews. Mandated by many journals for systematic-review protocols.

Preregistration addresses the FORKING PATHS problem (within a single study) but does NOT address PUBLICATION BIAS (across studies). A null preregistered finding can still be filed in the drawer. The fix for that is REGISTERED REPORTS — a publishing model proposed by Chambers (2013, Cortex) where the journal makes the publication decision in two stages:

  • Stage 1: review of the protocol. The journal evaluates the question, the design, the analysis plan — BEFORE data collection. If accepted, the journal commits in principle to publishing the eventual result, regardless of outcome.
  • Stage 2: review of the report. After data collection, the journal checks that the pre-specified analysis was actually run, that the results are reported faithfully, and that any exploratory analyses are clearly labelled. The publication decision is based on whether the procedure was followed — not whether the results were significant.

By 2024 roughly 200 journals offered the Registered Report format (Center for Open Science 2024 inventory), including Nature Human Behaviour, Royal Society Open Science, BMC Medicine, Psychological Science, Cortex, and European Journal of Personality. Empirical evaluations of the format show ~ 60% of Registered Reports report null or partial-support findings (Allen & Mehler 2019, PLoS Biol), compared to ~ 5–15% of conventional reports in the same fields — a direct measurement of the publication bias the format corrects.

Reporting guidelines: CONSORT, STROBE, PRISMA

Preregistration locks the analysis BEFORE data collection. Reporting guidelines describe what the WRITTEN-UP study should contain AFTER data collection, so a reader can evaluate whether the published analysis matches the preregistered plan. Three are essentially universal across the life sciences:

  • CONSORT 2010 (Moher et al. 2010, BMJ). CONsolidated Standards Of Reporting Trials. The reporting checklist for parallel-group randomised controlled trials. 25 items covering trial design, randomisation, blinding, outcomes, statistical methods, and a flowchart of participant flow from enrolment through analysis. Endorsed by the ICMJE; required by most medical journals.
  • STROBE (von Elm et al. 2007, Epidemiology). STrengthening the Reporting of OBservational studies in Epidemiology. The reporting checklist for cohort, case-control, and cross-sectional observational studies. 22 items covering sampling, exposure assessment, confounding control, and limitations. Variants exist for specific subtypes (STROBE-MR for Mendelian randomisation, STROBE-vet for veterinary).
  • PRISMA 2020 (Page et al. 2021, BMJ; original Liberati et al. 2009, PLoS Med). Preferred Reporting Items for Systematic reviews and Meta-Analyses. 27 items + a participant-flow diagram for systematic reviews and meta-analyses. The 2020 update modernised the search-strategy documentation and the bias-assessment requirements.

The guidelines are not preregistration — they describe how to REPORT a study, regardless of whether it was preregistered. But they complement preregistration: a fully-CONSORT-compliant report explicitly cross-references the preregistered protocol and flags any deviations. The EQUATOR Network (Enhancing the QUAlity and Transparency Of health Research, equator-network.org) maintains an up-to-date inventory of all major reporting guidelines across study types.

What preregistration does NOT fix

Preregistration is a partial fix, not a complete one. Three classes of problem it does not address:

  • Bad research questions. Preregistration locks the analysis path; it does not improve the question. A preregistered study of a misformulated, untestable, or trivial question is still misformulated, untestable, or trivial. The question itself comes from the science, not the procedure.
  • Low n / underpowered designs. §2.2 made the case that an underpowered study cannot deliver a confirmatory result regardless of the analysis path. Preregistration is the wrong place to fix this — the fix is upstream in the design phase. A preregistered underpowered study is at best clarified about its limitations and at worst gives a sheen of legitimacy to a futile experiment.
  • Questionable measurement. The flexibility of construct operationalisation is its own fork. If a researcher pre-registers that "anxiety will be measured by the SAS", but the SAS has 24 items, each with multiple sub-scales, and the analysis plan does not specify which sub-scale composite, then significant analytic freedom remains. Strzelecka et al. (2024) document this as the "measurement schmeasurement" problem; the fix is detailed pre-specification of the exact scoring rule, not just the instrument name.
  • Exploratory work. Some research is genuinely hypothesis-generating: high-throughput screens, exploratory data analysis, machine-learning model search. These cannot be preregistered in the §2.6 sense because the hypotheses are not yet formed. The procedural fix here is HONEST LABELLING — call exploratory work exploratory, report effect sizes and CIs, do not claim confirmatory inference. Wagenmakers et al. (2012, Perspect Psychol Sci) argue for the equally-rigorous "purely confirmatory" / "purely exploratory" split, with the former preregistered and the latter explicitly flagged.
  • Adversarial collaboration. When the field is genuinely uncertain about an effect, two opposing teams sometimes co-design a study with a preregistered plan agreed BY BOTH SIDES (Mellers, Hertwig & Kahneman 2001, Psychol Sci). This is rare but powerful — and it is downstream of, not replaced by, preregistration.

The honest position. Preregistration converts the operational Type-I rate from a free parameter (the researcher's effective freedom across the path menagerie) back to a controlled quantity (the declared α of the primary test). That is a substantial gain. It does NOT make underpowered, badly-measured, or trivially-framed studies good. The §2.8 replication crisis will return to this point: the literature's reproducibility depends on multiple disciplines simultaneously, of which preregistration is one.

The working researcher's preregistration workflow

Concretely, the steps for a new study are:

  • Frame the research question precisely. What is the one question this study answers? Pin it down to a sentence. If it takes a paragraph, the question is still vague.
  • Specify the design. Sample size with power justification, inclusion / exclusion criteria, measurement instruments with exact scoring rules.
  • Specify the analysis. Primary statistical test, all variables included, transformations applied, outlier rules. The §2.5 multiple-testing correction if there is more than one primary.
  • Specify the decision rule. What conclusion does each (p, effect, CI) combination support? Pair the p-value criterion with an effect-size threshold so significance and importance are decoupled.
  • Pre-register at OSF / AsPredicted / ClinicalTrials.gov. Time-stamp the plan. Make it public.
  • Collect the data. Per the inclusion/exclusion criteria, to the pre-stated n.
  • Run the pre-specified analysis. Exactly as written. Report it clearly as the confirmatory result.
  • Run any additional analyses. Label them EXPLORATORY. Report effect sizes and CIs; do not claim confirmatory inference.
  • Report ALL of it. Include all collected data, all considered variables, all attempted analyses, all deviations from the preregistered plan with explanations. The transparency is the trick.

Try it

  • In the forking-paths-simulator, run the default seed at 500 replicate datasets. Note the headline "best-path Type-I rate" — typically 12–18%, an inflation of ~ 3× the nominal 5%. Then re-roll the seed three times and re-check. The rate should orbit within a few percentage points of the central value; the Monte Carlo noise band at 500 datasets is about ±2 pp at this rate.
  • Same widget. On the manual path side, start with everything OFF (X1 only, raw Y, no drop, no interaction). Note the p for X1 on the current dataset. Now systematically click each option in turn (add X2, add log, etc.), watching how the p drifts. After 5–10 clicks you should have seen a path with p < 0.05, even though the truth is H₀. The phenomenon is exactly the garden of forking paths — innocent choices, accidental significance.
  • Same widget. Slide the replicate datasets count from 200 up to 5000. The headline rate becomes more stable; the histogram fills in more smoothly. Verify that the "pre-specified single path" rate stays at ~ 5% throughout — the histogram of min-p has high mass in the first bin under best-path selection, but the rate of "p < 0.05" for ONE FIXED path lands at the nominal α.
  • In the preregistration-builder, click "Load example". Read through the 9 filled-in fields. For each, click "why?" and read the failure mode foreclosed. The mapping is one-to-one: step 3 → optional stopping, step 5 → outlier p-hacking, step 6 → forking paths, step 7 → multi-outcome fishing, step 8 → significance-only inference, step 9 → confirmatory-claim sneaking.
  • Same widget. Clear the form. Draft a 9-step preregistration for a study you would actually run — or, if you don't have one, pick a published paper from your field and reconstruct what its preregistration would have looked like. You will likely find that some steps are easy (research question) and some are surprisingly hard (decision rule, confirmatory-vs-exploratory split). The hardness is the point.
  • Pen-and-paper. A psychology study reports a significant effect (p = 0.04, d = 0.32) with the following analysis: 4 candidate covariates (2 included), Y log-transformed, > 2σ outliers dropped, interaction included. Count the defensible alternative analyses on this data: 2⁴ covariate subsets (16) × 2 transforms × 3 outlier rules (none/2σ/3σ) × 2 interaction options = 192. Argue why the reported p = 0.04 is, in expectation, a much weaker signal than the same p from a preregistered single-path analysis would be.
  • Pen-and-paper. Distinguish §2.4-style P-HACKING (deliberate strategy choice — optional stopping, multi-outcome fishing) from §2.6-style FORKING PATHS (innocent choice among defensible analyses). Argue why the latter is harder to detect (the researcher need not be conscious of it) and why preregistration is essential even when no malicious intent is at play. Cite Gelman & Loken (2014) for the canonical statement.
  • Pen-and-paper. Compare the Nosek et al. (2018) replication-rate gap (~ 50% for preregistered, ~ 36% for non-preregistered psychology studies) with the John, Loewenstein & Prelec (2012) prevalence estimates of questionable research practices (admission rates: 1–66% across 10 named QRPs in 2155 psychologists). Argue why the gap is consistent with a literature in which the average non-preregistered study has implicitly traversed a moderate forking path.
  • Pen-and-paper. A clinical trial of a new analgesic is preregistered on ClinicalTrials.gov with primary endpoint "VAS pain score at 4 hours" and 3 secondary endpoints. The trial reports the primary endpoint p = 0.18 (not significant), but the manuscript headlines a secondary endpoint p = 0.02 (one of the three). Read the trial against the §2.5 + §2.6 framework: (a) what was the FAMILY for the §2.5 correction? (b) is the secondary-endpoint headline a confirmatory or exploratory claim? (c) if BH FDR at q = 0.05 was the pre-specified correction, was p = 0.02 below the BH-3 threshold (3/3 · 0.05 = 0.05 for the largest)?
  • Pen-and-paper. The Center for Open Science's Registered Reports inventory documents about 60% null-or-partial findings in published Registered Reports vs 5–15% in conventional reports from the same journals. Assuming the underlying truth rate of null findings does not differ between submission tracks, what does this 4–10× gap imply about the operational selection bias in conventional reports? Identify the corrective mechanism the Registered Reports format adds beyond §2.6 preregistration alone.
  • Pen-and-paper. CONSORT 2010 item 6a requires "completely defined pre-specified primary and secondary outcome measures, including how and when they were assessed". STROBE item 7 requires "clearly define all outcomes, exposures, predictors, potential confounders, and effect modifiers". Map each to the corresponding step of the §2.6 9-step template. Argue why reporting guidelines like CONSORT/STROBE are upstream-of-publication checks on whether the preregistration was followed, not replacements for the preregistration itself.
  • Pen-and-paper. The "measurement schmeasurement" problem (Flake & Fried 2020, Adv Methods Pract Psychol Sci): the SAS social-anxiety scale has 24 items, multiple sub-scales, and several composite scoring algorithms. A preregistration that says "we will measure anxiety with the SAS" leaves a fork of comparable size to the forking-paths-simulator. Sketch the level of detail a preregistration WOULD have to reach to actually collapse this fork — including the exact item set, the composite formula, and any reverse-scoring decisions.

Pause and reflect: §2.6 has converted the LOOSE call of §2.4 ("pre-register the analysis plan") into a structured nine-step procedural discipline, and quantified WHY each step matters by simulating the inflation it forecloses. The forking-paths-simulator gives the visceral picture: 192 defensible analyses on the same H₀-true dataset, ~ 14% empirical Type-I under best-path selection (roughly 3× the nominal α), ~ 5% under pre-specification. The preregistration-builder gives the procedural picture: nine fields, one failure mode foreclosed per field, a YAML document ready for OSF or AsPredicted. The honest caveat: preregistration does not fix low n, bad measurement, bad questions, or substandard exploratory work — those need separate fixes. But it converts the operational Type-I rate from a free parameter into a controlled quantity, which is the single procedural keystone of confirmatory inference. §2.7 will tackle equivalence testing (TOST) — the right machinery for supporting "no meaningful effect" rather than just failing to reject H₀. §2.8 will integrate everything into the replication crisis: what happens to a literature when these procedural disciplines fail systematically across a generation.

What you now know

You can articulate the GARDEN OF FORKING PATHS (Gelman & Loken 2014): a real analysis contains many implicit defensible choices — variables, transformations, outlier rules, subgroups, model forms, error structures — and the researcher's freedom to walk any one of them inflates the operational Type-I rate, even without conscious p-hacking. You can quantify the inflation: on the widget design (5 covariates, 192 paths under H₀), the empirical best-path Type-I rate lands at ~ 12–18% — roughly 3× the nominal 5%, capped by heavy path correlation but still well above α. In real research with less-correlated paths (genuinely different outcome variables, different operationalisations of the same construct), the inflation factor can be much higher.

You can describe PREREGISTRATION as the procedural fix: write the analysis plan BEFORE seeing the data, time-stamp it on OSF / AsPredicted / ClinicalTrials.gov / EU CTR / PROSPERO, run the pre-specified analysis as the confirmatory test, label everything else exploratory. You can walk the 9-STEP TEMPLATE — research question, hypotheses with directionality, sample size with power justification, inclusion/exclusion, outlier handling, primary analysis, multiple-testing correction, decision rule, confirmatory vs exploratory declaration — and map each step to the specific failure mode it forecloses.

You can describe REGISTERED REPORTS (Chambers 2013): a publishing model where the journal accepts the paper in principle BEFORE data collection, based on the question and methods. This addresses publication bias (which preregistration alone does not) and explains the 60% null-or-partial finding rate in Registered Reports vs 5–15% in conventional reports.

You can name the REPORTING GUIDELINES that complement preregistration: CONSORT 2010 (RCTs), STROBE 2007 (observational), PRISMA 2020 (systematic reviews). Maintained by the EQUATOR Network. These describe what the WRITTEN-UP study should contain so reviewers can check that the published analysis matches the preregistered plan.

You can state the HONEST CAVEATS: preregistration does not fix bad research questions, low n, questionable measurement, or genuinely exploratory work. It is a partial fix, addressing one specific failure mode (analytic-path selection). The replication crisis (§2.8) is downstream of multiple simultaneous failures, of which forking paths is one.

You can connect the procedural discipline to the EMPIRICAL BASE: Nosek et al. (2018) review the impact of preregistration; John, Loewenstein & Prelec (2012) document QRP prevalence (1–66% across 10 named practices in 2155 psychologists); Munafò et al. (2017) consolidate the procedural recommendations in the Nature Human Behaviour reproducible-science manifesto. Open Science Collaboration (2015) measured the psychology reproducibility rate at ~ 36% — the empirical baseline §2.6 disciplines aim to lift.

Where this lands in Part 2. §2.7 covers EQUIVALENCE TESTING (TOST, Schuirmann 1987) — the formal machinery for supporting "no meaningful effect" when a non-significant frequentist test cannot. §2.8 integrates §2.4 (p-values), §2.5 (multiple testing), §2.6 (preregistration), and §2.7 (equivalence) into the REPLICATION CRISIS — the literature-scale consequences when these procedural disciplines fail systematically over a generation.

References

  • Gelman, A., Loken, E. (2014). "The statistical crisis in science." American Scientist 102(6), 460–465. (The canonical "garden of forking paths" paper. The argument that implicit analysis choices inflate Type-I even without conscious p-hacking. Borges-via-statistics.)
  • Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). "False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant." Psychological Science 22(11), 1359–1366. (The companion piece: simulations show how researcher degrees of freedom push false-positive rates above 60% with no overt p-hacking.)
  • Nosek, B.A., Ebersole, C.R., DeHaven, A.C., Mellor, D.T. (2018). "The preregistration revolution." PNAS 115(11), 2600–2606. (The state-of-the-field review of preregistration's impact, including the replication-rate gap and the OSF infrastructure.)
  • Chambers, C.D. (2013). "Registered reports: a new publishing initiative." Cortex 49(3), 609–610. (The original proposal of the two-stage Registered Reports format, addressing publication bias against null results.)
  • Munafò, M.R., Nosek, B.A., Bishop, D.V.M., Button, K.S., Chambers, C.D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J.J., Ioannidis, J.P.A. (2017). "A manifesto for reproducible science." Nature Human Behaviour 1(1), 0021. (The integrative manifesto: preregistration, registered reports, open data, open code, transparency, and replication as the procedural toolkit for reproducible science.)
  • Open Science Collaboration (2015). "Estimating the reproducibility of psychological science." Science 349(6251), aac4716. (The 100-replication psychology study; baseline reproducibility ~ 36%, the empirical motivation for the §2.6 disciplines.)
  • John, L.K., Loewenstein, G., Prelec, D. (2012). "Measuring the prevalence of questionable research practices with incentives for truth telling." Psychological Science 23(5), 524–532. (The 2155-psychologist anonymous survey: admission rates of 1–66% across 10 named QRPs including optional stopping, selective reporting, post-hoc subgroup analysis, and HARKing.)
  • Moher, D., Hopewell, S., Schulz, K.F., Montori, V., Gøtzsche, P.C., Devereaux, P.J., Elbourne, D., Egger, M., Altman, D.G. (2010). "CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials." BMJ 340, c869. (The canonical CONSORT 2010 reporting checklist for parallel-group RCTs. 25 items + participant-flow diagram.)
  • von Elm, E., Altman, D.G., Egger, M., Pocock, S.J., Gøtzsche, P.C., Vandenbroucke, J.P. (2007). "STROBE statement: guidelines for reporting observational studies." Epidemiology 18(6), 800–804. (The STROBE 2007 reporting checklist for cohort, case-control, and cross-sectional observational studies. 22 items.)
  • Liberati, A., Altman, D.G., Tetzlaff, J., Mulrow, C., Gøtzsche, P.C., Ioannidis, J.P.A., Clarke, M., Devereaux, P.J., Kleijnen, J., Moher, D. (2009). "The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration." PLoS Medicine 6(7), e1000100. (The PRISMA 2009 reporting checklist for systematic reviews and meta-analyses. 27 items + participant-flow diagram. Updated to PRISMA 2020 by Page et al. 2021.)
  • Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H.L.J., Kievit, R.A. (2012). "An agenda for purely confirmatory research." Perspectives on Psychological Science 7(6), 632–638. (The case for the rigorous "purely confirmatory" / "purely exploratory" split, with preregistration as the gating mechanism for the former.)
  • Nosek, B.A., Alter, G., Banks, G.C., Borsboom, D., Bowman, S.D., Breckler, S.J., Buck, S., Chambers, C.D., Chin, G., Christensen, G., et al. (2015). "Promoting an open research culture." Science 348(6242), 1422–1425. (The TOP — Transparency and Openness Promotion — guidelines: an 8-standard framework for journals to incentivise pre-registration, open data, open code, and open materials. ~ 5000 journals signatory by 2024.)
  • Allen, C., Mehler, D.M.A. (2019). "Open science challenges, benefits and tips in early career and beyond." PLoS Biology 17(5), e3000246. (Reports the ~ 60% null-finding rate in published Registered Reports vs ~ 5–15% in conventional reports — the direct measurement of publication bias the format corrects.)
  • Center for Open Science (2024). Registered Reports inventory. https://www.cos.io/initiatives/registered-reports (Maintained inventory of journals offering the Registered Reports format; ~ 200 journals by 2024.)
  • Mellers, B., Hertwig, R., Kahneman, D. (2001). "Do frequency representations eliminate conjunction effects? An exercise in adversarial collaboration." Psychological Science 12(4), 269–275. (The canonical example of adversarial collaboration: two opposing teams pre-register a joint protocol, agreeing in advance on what data would settle the dispute. Rare but exemplary.)
  • Armitage, P., McPherson, C.K., Rowe, B.C. (1969). "Repeated significance tests on accumulating data." Journal of the Royal Statistical Society Series A 132(2), 235–244. (The foundational paper on the Type-I inflation from optional stopping, cited in the §2.6 sample-size template as the failure mode that pre-stating n forecloses.)
  • Wagenmakers, E.-J. (2007). "A practical solution to the pervasive problems of p values." Psychonomic Bulletin & Review 14(5), 779–804. (The optional-stopping Type-I inflation quantified for typical NHST psychology designs; canonical reference for "peeking at the data" inflation.)
  • Flake, J.K., Fried, E.I. (2020). "Measurement schmeasurement: questionable measurement practices and how to avoid them." Advances in Methods and Practices in Psychological Science 3(4), 456–465. (The "measurement schmeasurement" critique: preregistration of the analysis path does not foreclose measurement-operationalisation flexibility unless the exact scoring rule is also pre-specified.)
  • Page, M.J., McKenzie, J.E., Bossuyt, P.M., Boutron, I., Hoffmann, T.C., Mulrow, C.D., Shamseer, L., Tetzlaff, J.M., Akl, E.A., Brennan, S.E., et al. (2021). "The PRISMA 2020 statement: an updated guideline for reporting systematic reviews." BMJ 372, n71. (PRISMA 2020 update: modernised search-strategy documentation, bias-assessment requirements, and AI-tool reporting for systematic reviews.)
  • Goodman, S.N. (2008). "A dirty dozen: twelve P-value misconceptions." Seminars in Hematology 45(3), 135–140. (Twelve named misinterpretations of the p-value; the §2.4 reference that motivates the §2.6 procedural cleanup.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.