The replication crisis and what to actually do
Learning objectives
- Recall the EMPIRICAL findings of the major replication projects: Open Science Collaboration (2015, Science) — 36% of 100 psychology effects reached statistical significance with the same direction in the replication; Camerer et al. (2018, Nature Human Behaviour) — 61% of 21 economics experiments replicated; Klein et al. Many Labs 2 (2018, AMPPS) — 14/28 effects (50%) reached the conventional cutoff; Klein et al. Many Labs (2014, Soc Psychol) — 10/13 robust
- State the FOUR mechanical contributors that compound into the literature-level inflation: (i) publication bias (only 'significant' published), (ii) p-hacking from §2.4 (optional stopping, forking paths, selective reporting), (iii) multiple-testing flexibility from §2.5, (iv) low power from §2.2 — and recognise these are the §2.4–§2.7 failures, just summed over a generation
- Define TYPE-M ERROR (magnitude error, Gelman & Carlin 2014, Perspect Psychol Sci): in under-powered designs, even genuinely-true effects, conditional on having reached p < 0.05, are SYSTEMATICALLY over-estimated — the published effect size is the true effect size multiplied by an inflation factor that grows as power drops
- Define TYPE-S ERROR (sign error, Gelman & Carlin 2014): in under-powered designs, conditional on having reached p < 0.05, the probability that the published effect has the WRONG SIGN is non-negligible (can exceed 5% at power < 0.10). The literature carries not just inflated magnitudes but occasional inverted directions
- Quote Ioannidis (2005, PLoS Med) — 'Why most published research findings are false' — and reproduce the key argument: under realistic priors on the proportion of true hypotheses, plus power and bias, the positive predictive value of a 'significant' finding can be well below 50%
- Cite Begley & Ellis (2012, Nature) — the Amgen replication audit: 47 of 53 landmark preclinical cancer findings did NOT replicate (11% replication rate). The crisis is not psychology-specific; biomedicine has its own version
- State the SIX principles of the Manifesto for Reproducible Science (Munafò et al. 2017, Nature Human Behaviour): (1) protect against cognitive biases via blinding and preregistration, (2) improve methodology with larger samples and better design, (3) improve reporting via open data and code, (4) reproducibility and replication TRAINING, (5) diversify peer review, (6) reward open and reproducible practices in hiring/promotion
- List the PRACTICAL agenda — what a researcher can do TODAY: preregister using the AsPredicted 8-question or OSF template (Simonsohn et al. 2014; Nosek et al. 2018), compute power BEFORE collecting data, share data and code on OSF or GitHub, report effect sizes WITH CIs (not just p-values), avoid the p < 0.05 dichotomy (Wasserstein, Schirm, Lazar 2019, Am Statistician), distinguish confirmatory from exploratory, consider Bayesian alternatives or equivalence testing where appropriate (§2.7)
- Recognise PUBLICATION BIAS as a literature-level phenomenon: funnel-plot asymmetry (Egger et al. 1997, BMJ) is the canonical diagnostic; the bottom-left corner of the funnel (small studies with null or negative effects) empties out, leaving the meta-analytic mean systematically inflated above the true mean
- State the HONEST CAVEATS: methodological reform takes a generation because the incentive structure (publication, funding, promotion) still rewards 'significant' findings; some fields (psychology after 2015) are further along than others; methodological reform alone does NOT fix bad RQs or bad measurement
- Recognise the limits of preregistration: it disciplines THE ANALYSIS but does not improve the data quality, the measurement validity, or the question's importance. Preregistration is necessary; not sufficient
- Connect §2.8 backwards: §2.4 p-values explain WHY a single study can mis-state; §2.5 multiple-testing explains WHY a multi-outcome study can mis-state; §2.6 preregistration explains the FIX at the study level; §2.7 equivalence explains the FIX at the inference level; §2.8 shows what happens when those fixes are systematically absent at the literature level
- Preview Parts 3 + 7 + 8: Part 3 (CIs) is the replacement for the p < 0.05 dichotomy; Part 7 (Bayesian) is the alternative paradigm Wasserstein et al. (2019) and Kruschke (2018) endorse for evidence quantification; Part 8 (resampling) provides bootstrap CIs as more robust alternatives to parametric inference
§2.4 through §2.7 have built every procedural fix the literature has assembled for the integrity problems of the last half-century: §2.4 explained what a p-value is and the mechanical ways flexibility in the data-analysis path inflates the false-positive rate (optional stopping, garden of forking paths, selective reporting); §2.5 controlled for the multiple-testing inflation via FWER (Bonferroni, Holm) and FDR (Benjamini–Hochberg); §2.6 imposed pre-data discipline on the analysis plan through preregistration; §2.7 supplied the machinery for supporting the null through equivalence testing and TOST. Each section is local in scope — a single study, a single analysis, a single decision.
§2.8 zooms out. It asks what happens when those procedural disciplines are SYSTEMATICALLY ABSENT across a body of literature spanning a generation of researchers. The answer is the REPLICATION CRISIS: when major replication projects in psychology (Open Science Collaboration 2015, Science), economics (Camerer et al. 2018, Nature Human Behaviour), and biomedicine (Begley & Ellis 2012, Nature) audited the published literature, between 36% and 50% of findings failed to replicate. This is not noise — it is the predictable downstream consequence of small samples, flexible analyses, and a publication system that filters out null results. §2.8 names the mechanism, quantifies it, and lays out the practical research-reform agenda that follows.
The arc has six stops. First, the EMPIRICAL evidence: the four large-scale replication projects and what they found. Second, the MECHANICS of inflation: how §2.4 + §2.5 + §2.2 + publication bias compound into Type-M and Type-S errors at the literature level (Gelman & Carlin 2014, Perspectives on Psychological Science). Third, the FUNNEL-PLOT widget: a simulation of how a tunable publication filter empties the small-study corner and inflates the meta-analytic mean. Fourth, the Manifesto for Reproducible Science (Munafò et al. 2017, Nature Human Behaviour) and its six principles. Fifth, the CHECKLIST widget: a 20-item operationalisation of the practical agenda, with the explicit failure mode each item forecloses. Sixth, the HONEST CAVEATS — what reform can and cannot deliver — and the connections forward to Parts 3, 7, and 8 where the alternative inferential paradigms live.
The empirical evidence: how often does the literature replicate?
By the early 2010s, scattered evidence had begun to accumulate that the published rate of "significant" findings was inconsistent with the underlying base rate of real effects (Pashler & Wagenmakers 2012, Perspect Psychol Sci). The decisive empirical test came in the form of large-scale, pre-registered REPLICATION PROJECTS — independent teams attempting to reproduce previously-published findings using protocols identical to (or stronger than) the originals.
- Open Science Collaboration (2015, Science). One hundred experimental and correlational studies from three leading psychology journals (Psychological Science, JPSP, JEP:LMC) were replicated by 270 contributing authors across 64 sites. Of the 97 with available test statistics, the original studies reported p < 0.05; the replications reproduced statistical significance in 36% of cases (35/97), with the same direction. Mean effect sizes in the replications were half the originals (r = 0.20 vs r = 0.40). This is the canonical paper of the replication crisis. The full database, including all replication protocols and analysis scripts, is open on OSF.
- Camerer et al. (2018, Nature Human Behaviour). 21 experimental social-science papers published in Nature and Science between 2010 and 2015. With pre-registered replications at average n ≈ 5× the original n, 13 of 21 (62%) reached statistical significance in the same direction — better than the OSC 2015 rate but still well below 100%. Mean replicated effect sizes were 50% of the originals.
- Klein et al. Many Labs 2 (2018, Advances in Methods and Practices in Psychological Science). 28 well-known psychology effects, replicated by 125 independent samples across 36 countries (n_total ≈ 15,000+). The point of Many Labs 2 was to control for heterogeneity-across-samples explanations: even with this design, only 14 of 28 effects (50%) showed evidence of the original effect at conventional significance levels.
- Klein et al. Many Labs 1 (2014, Social Psychology). 13 classical psychology effects across 36 samples, totalling n ≈ 6,300. Ten of 13 replicated robustly; three (currency priming, flag priming, imagined contact) did not. This was the warm-up for Many Labs 2 and an early signal that replicability varies sharply by effect type.
- Begley & Ellis (2012, Nature). An audit of 53 landmark pre-clinical cancer findings by Amgen scientists found only 6 (11%) were independently reproducible. The crisis is not confined to psychology; biomedicine has its own, with similar mechanics (small samples, flexible analyses, publication bias).
The four numbers — 36%, 62%, 50%, 11% — depend on field, design, definition of "replication", and the original effect distribution. They are NOT a unified replication rate. But they are uniformly far below the ~ 85–90% one would expect if published p < 0.05 findings reflected real effects of the reported magnitude. That gap is the replication crisis, quantified.
Why this happens — the synthesis of §2.4–§2.7
The replication-crisis numbers are not a mystery once §2.4–§2.7 are in hand. Four mechanisms compound; each is named in an earlier section.
- Publication bias. The strongest single contributor. Studies that reach "p < 0.05" are several times more likely to be submitted, accepted, and cited than studies that fail to (Franco, Malhotra, Simonovits 2014, Science). The published literature is a filtered subset of the conducted literature, biased TOWARD positive findings. If 100 underpowered studies of a null effect run at α = 0.05, about 5 reach significance; the literature publishes those 5 and discards the 95 that failed.
- p-hacking (§2.4). The mechanical inflation operating WITHIN a single published study: optional stopping (Armitage et al. 1969, JRSS-A), the garden of forking paths (Gelman & Loken 2014, Am Sci), selective endpoint reporting (Chan et al. 2004, JAMA). Each individual fork inflates the per-study Type-I rate from 5% to anywhere up to ~ 60% (Simmons, Nelson, Simonsohn 2011, Psychol Sci). The garden-of-forking-paths inflation is invisible to the author because it accrues across decisions the author never consciously aggregated.
- Multiple-testing flexibility (§2.5). Without preregistered correction, k tests at α = 0.05 inflate the FWER to ~ 1 − (1 − 0.05)k. For k = 20 — entirely realistic in a paper with 10 outcomes × 2 subgroups — that is 64%. The BH FDR alternative addresses this if pre-specified, but most papers do not pre-specify any correction.
- Low power (§2.2) and the Type-M / Type-S errors. The least-intuitive of the four. In an under-powered study (power < 0.50), the studies that reach significance are precisely those that happen to over-estimate the true effect — by sampling chance. Gelman and Carlin (2014, Perspectives on Psychological Science) call this the magnitude error or Type-M error. Under a true d = 0.20 and n = 20, the inflation factor conditional on p < 0.05 is roughly 3× — the literature reports d ≈ 0.60 even when the truth is d = 0.20. Even worse, the sign error (Type-S) — the probability that the published estimate has the wrong sign — can exceed 5% in very-low-power designs.
Combined: a literature dominated by under-powered studies, with selective reporting of significant results, with flexible analysis paths whose forks are never disclosed, will SYSTEMATICALLY report effect sizes that are inflated, sometimes inverted, and unreliable when independent replicators try to find them. The replication-rate numbers in the previous subsection are exactly the predictable downstream consequence.
Ioannidis (2005, PLoS Medicine) — "Why most published research findings are false" — gave the formal account. Under realistic priors on the proportion of true hypotheses in a field (the "pre-study odds"), realistic power, and realistic bias, the POSITIVE PREDICTIVE VALUE of a "significant" finding (the conditional probability that a p < 0.05 finding reflects a real effect) can fall well below 50%. The paper has been cited over 12,000 times and is the most-read article ever published in PLoS Medicine.
Watching publication bias build the funnel — the publication-bias-funnel widget
The most visible literature-level signature of publication bias is FUNNEL-PLOT ASYMMETRY (Egger et al. 1997, BMJ). A funnel plot is a scatter of (observed effect size, precision = 1/SE) across the studies in a research area. Under no publication bias, the cloud is symmetric about the true effect: large-precision (high-n) studies cluster tightly around the truth at the top of the funnel; low-precision (low-n) studies fan out symmetrically below, forming the funnel's flared base. Under publication bias, the LOWER-LEFT corner of the fan — small studies with null or negative effects, exactly the studies that fail to reach significance — empties out, and the cloud becomes asymmetric.
The first widget makes the funnel-building process interactive. Pick a "field profile" (the true effect distribution: null, small, medium, mixed), then slide the publication-bias strength from 0% (every conducted study published) to 100% (only significant positive results published). The widget simulates a body of studies under that profile and applies the filter, then plots the surviving studies on the funnel.
Things to verify in the widget:
- Start at the "Small effect (true d ≈ 0.2)" profile with publication-bias = 0%. The funnel is symmetric about the dashed blue true-mean line. The published-effect mean tracks the true 0.20 to within Monte Carlo noise. The grey curves are the |z| = 1.96 significance envelope: studies outside the envelope have p < 0.05 (the standard cutoff). At low precision (bottom of the funnel), the envelope is wide because the SE is large; at high precision (top), it narrows.
- Slide publication-bias up to 50%. The funnel's lower-left corner empties out — small studies with null or negative effects are now censored (grey × markers). The published-effect mean (top of the numeric panel) drifts UPWARD from 0.20 toward ~ 0.30. This is the publication-bias inflation operating in real time: the published literature now over-states the true effect by ~ 50%.
- Slide publication-bias to 100%. Only significant + positive studies remain. The funnel's entire lower-left half is gone; the published-mean climbs toward ~ 0.50 or higher. The inverse-variance-pooled mean (which a meta-analysis would report) climbs with it. The literature, treated as a meta-analytic input, would now over-state the effect by 2–3×. The Egger funnel-asymmetry test would detect this readily; an unwary meta-analyst who didn't check would simply inherit the bias.
- Switch to the "Null field (true d ≈ 0)" profile with publication-bias = 100%. The truth is no effect, but the surviving published studies all sit OUTSIDE the |z| = 1.96 envelope on the positive side. The published mean is well above zero. A casual reader, faced with this literature, would conclude there IS an effect — when in fact every published finding is a sampling-noise positive. This is the Ioannidis (2005) scenario in graphical form: in a field with weak true effects, low power, and strong publication bias, the published mean can be entirely manufactured by the filter.
- Vary the "Mean per-study n" radio between 20, 40, 80, 160. As n grows, individual studies are more precise, fewer fail to reach significance under the true effect, and the inflation drops — even under strong publication bias. This is the SAMPLE-SIZE side of the reform agenda: larger studies inherit less bias because more of them survive the filter on merit, not on noise.
- Re-roll the seed at a fixed setting a few times. The empirical published-mean fluctuates but the systematic bias DIRECTION is invariant: publication bias makes the literature over-state, never under-state, the truth (assuming the bias favours positive findings, which it almost always does).
The Manifesto for Reproducible Science
The institutional response to the replication crisis crystallised in Munafò et al. (2017, Nature Human Behaviour) — the Manifesto for Reproducible Science, a 30-author consensus statement covering psychology, biomedicine, and the broader life sciences. The Manifesto lays out six principles, each a counter to a specific failure mode in §2.4–§2.7.
- Protect against cognitive biases. Use blinding (analyst-blind, double-blind) and PREREGISTRATION (§2.6) so the analyst's incentives don't bend the analysis path. This is the §2.6 garden-of-forking-paths fix, applied at scale.
- Improve methodology. Larger samples (power ≥ 0.80 at the smallest effect of interest, ideally 0.90), better measurement validity, registered replications. This is the §2.2 power discipline plus the §2.7 SESOI discipline applied at the design stage.
- Improve reporting and dissemination. Open data, open code, open methods. Tools: OSF (osf.io), Zenodo (zenodo.org), GitHub. Without the raw data and the code, replication is logically impossible and "replication failure" cannot be cleanly attributed to a flaw in the original or the replicator.
- Reproducibility and replication training. Make the §2.4–§2.7 procedural disciplines part of the graduate curriculum — Type-M error, optional stopping, FDR, preregistration. This is what every part of this textbook contributes toward.
- Diversify peer review. Move from single-shot pre-publication review toward continuous post-publication review (PubPeer, comment threads on the article landing page), and Registered Reports — where the methods and analysis plan are reviewed BEFORE data collection and the paper is conditionally accepted regardless of the outcome (Chambers 2013, Cortex).
- Reward open and reproducible practices. Hiring, tenure, and funding decisions must START to weigh reproducibility — preregistrations, open data, replications — alongside publication count and citation metrics. This is the slow institutional change Munafò et al. flag as the binding constraint.
The Manifesto is now the touchstone consensus document for reform. It does not pretend the agenda is complete or that adoption is rapid. It does explicitly link each principle to a specific failure mode in the current literature, which is what makes it usable as a checklist.
What an individual researcher can do TODAY
The Manifesto's six principles target the institutional level. The corresponding individual-researcher agenda is short and actionable, and every item maps onto a tool that already exists in 2026.
- Preregister, before any data collection. Use the AsPredicted (aspredicted.org) 8-question template for the lightest entry point (Simonsohn et al. 2014, SSRN) or the longer OSF preregistration template (Nosek et al. 2018, PNAS). The §2.6 widget is a working draft tool for this.
- Compute power BEFORE collecting data. Pick the smallest effect that would matter (the SESOI from §2.7). Compute n needed for power ≥ 0.80 (or 0.90) against that effect. If the budget can't support the required n, the choice is not "run the underpowered study and hope" — it is "find collaborators / use a larger sample / consider a sequential design / abandon this design".
- Share data and code. OSF for both, with a DOI. GitHub for code with a tagged commit matching the submission. Include the environment files (requirements.txt, renv.lock, Dockerfile) so the numbers are reproducible without manual dependency archaeology. Wilson et al. (2017, PLOS Comp Bio) make the case for code as a first-class output of research.
- Report effect sizes WITH CIs, not just p-values. A p-value alone cannot tell the reader whether the effect is practically meaningful or a tiny artefact. An effect size with a 95% CI does both jobs. Wasserstein, Schirm, Lazar (2019, American Statistician) — the editorial that introduced the special issue "Moving to a world beyond p < 0.05" — make this the first recommendation.
- Do not dichotomise at p = 0.05. Report the exact p, interpret with the effect size. Wasserstein et al. (2019) describe the dichotomy as the single biggest driver of selective reporting in the modern literature. The American Statistical Association's 2016 statement on p-values (Wasserstein & Lazar 2016, Am Stat) had already called the cliff "arbitrary" three years earlier.
- Distinguish CONFIRMATORY from EXPLORATORY analyses. Label each in the paper. The preregistration locks the confirmatory; everything else is exploratory and must be reported as such. This single discipline foreclosures the "I found something interesting and wrote up the paper around it" failure mode (Nosek et al. 2018).
- Consider Bayesian or equivalence-test alternatives. Bayesian inference (Part 7) gives posterior probabilities — direct statements about parameters — that side-step the p-value misinterpretation problem entirely. Equivalence testing (§2.7) is the correct tool when the research question is "is the effect small enough to ignore?". Kruschke (2018, AMPPS) and Wagenmakers et al. (2018, Psychon Bull Rev) lay out the Bayesian routes; Lakens (2017, Soc Psychol Pers Sci) lays out the equivalence-test route.
The practical agenda, made operational — the research-reform-checklist
The Manifesto + the individual-researcher agenda together resolve into a 20-item checklist across 5 categories: preregistration, sample size and power, analysis plan, data and code sharing, and reporting standards. The second widget is that checklist. Each item is a yes/no question; each has a "why?" tooltip naming the specific failure mode it forecloses. A score updates live, with per-category breakdown — so the reader can see WHICH category is the weakest link in their own planned study.
Things to verify in the widget:
- Click "Mark all yes" once. The overall score reaches 100% and the verdict turns green. This is the IDEAL study profile — every procedural discipline in place. The categories that are "load-bearing" (preregistration is published before data, primary analysis identified, power analysis done, data and code shared, effect sizes reported with CIs) are flagged with a yellow "load-bearing" badge — those items count double in the score, exactly because they are the disciplines whose absence drives the bulk of the replication-crisis inflation.
- Click "Clear all". Now uncheck preregistration items but leave the rest. The "1. Preregistration" category turns red, the overall score drops, and the recommendation list at the bottom flags preregistration as the weakest link. This is the typical-study-circa-2015 profile.
- Set power to "no" on all items, leave preregistration and sharing at "yes". The "2. Sample size and power" category turns red. The recommendation list flags it. This is the under-powered-but-otherwise-careful study — common in budget-constrained labs.
- Click "why?" on a few items. The tooltip names the specific failure mode the item forecloses, with the §2.4–§2.7 reference: e.g., the hard-stop-on-n item cites Armitage et al. (1969) on optional stopping; the dichotomy item cites Wasserstein et al. (2019); the equivalence item cites Altman & Bland (1995). The checklist is not arbitrary — every item maps onto a specific procedural fix, named in an earlier section.
- For your own next planned study, run the checklist before any data collection. The categories below 50% are the gaps to close FIRST. Closing a gap means changing the design or the protocol, not changing the analysis after the fact.
Honest caveats — what reform cannot do
The procedural-reform agenda has limits that honesty about the §2.8 project requires naming.
- The incentive structure changes slowly. Publication-bias filters operate at the journal level, the citation level, and the hiring-and-tenure level. Even a researcher who runs a perfectly preregistered, well-powered, openly-shared study still competes for journal space against under-powered "p < 0.05" studies that report inflated effects. As of 2026, Registered Reports are still a small minority of published research. The incentives are slow to bend.
- Some fields are further along than others. Psychology, after the 2015 OSC paper, became the testbed for reform. Many psychology journals now require preregistration; OSF preregistrations are routine for graduate work. Economics and biomedicine have lagged, with fields-specific reasons: biomedicine's preclinical research uses small-n animal cohorts where standard power calculations are awkward, and economics field experiments have ethical and budget constraints that complicate replication.
- Methodological reform does NOT fix bad questions. If a study asks an uninteresting question, a perfectly-conducted analysis of a perfect dataset still produces an uninteresting answer. The procedural disciplines protect against fabricated-by-noise positives; they cannot identify which questions are worth asking.
- Methodological reform does NOT fix bad measurement. If the dependent variable is a noisy proxy for what the researcher cares about, preregistration locks the analysis but cannot transform the measure into something more informative. Construct validity (Cronbach & Meehl 1955, Psychol Bull) is a separate, prior, discipline.
- Replication failure has multiple causes. A failed replication can mean: the original was a false positive (the canonical case), the replication was under-powered, the effect is real but moderated by a context the original held fixed and the replicator did not (the "hidden moderator" hypothesis), or the replication itself contained an error. Disentangling these is non-trivial; Many Labs 2 (Klein et al. 2018) was specifically designed to address the context-hidden-moderator hypothesis at scale.
- Preregistration is necessary, not sufficient. A study can be preregistered and still ask a bad question, use a noisy measure, or be analysed by an honest mistake. Preregistration constrains the analysis path; it does not validate the study.
Try it
- In the publication-bias-funnel, set the field profile to "Small effect (true d ≈ 0.2)" and publication-bias to 0%. Note the published-mean ≈ 0.20. Now slide bias to 1.0 (100%). Read the new published-mean. Compute the inflation factor (published / true). Argue why this is the literature-level Type-M error, distinct from the per-study Type-M error in Gelman & Carlin (2014).
- Same widget. Set the field profile to "Null field (d ≈ 0)" and publication-bias to 1.0. Click Re-roll a few times. Note the published mean — non-zero, sometimes substantially so. Argue from this: in a field with no real effect but strong publication bias, the literature can manufacture an apparent effect entirely from the filter. Relate this to Ioannidis (2005) and the positive predictive value of a "significant" finding.
- Same widget. At the "Small" profile and bias = 0.5, switch the mean-n between 20 and 160. Note how the published-mean changes. Argue why larger n attenuates publication-bias inflation: at large n, almost all studies reach significance under the true effect, so the filter discards proportionally fewer studies on the basis of luck.
- In the research-reform-checklist, click "Mark all yes" and read the score. Now uncheck all items in category 2 (Power). Read the recommendation list at the bottom. Now also uncheck preregistration and sharing items. Note which category drops the score the most — the load-bearing items (yellow badge) carry double weight, exactly because their absence drives the bulk of the literature inflation.
- Same widget. Take a published paper you know well. Run through the 20 items as if you were planning that study. How many items would you have to check "no" on? Which category is the weakest? What single procedural change would have the biggest effect on the study's reproducibility?
- Pen-and-paper. Gelman & Carlin (2014) derive the Type-M inflation factor as approximately E[|d̂| | p < 0.05] / |d_true|. For a true d = 0.20 with per-group n = 25 (so SE ≈ √(2/25) ≈ 0.28), the z = d/SE ≈ 0.71. Argue: at this z, the power at α = 0.05 (two-sided) is about 0.11. Conditional on |z_obs| > 1.96 — i.e., on reaching significance — what is the expected magnitude of d̂ ? Show that the inflation factor is greater than 2. (Hint: use the truncated-normal mean formula.)
- Pen-and-paper. Open Science Collaboration (2015) reports that the replication effect sizes were on average 50% of the original effect sizes (r-replication / r-original ≈ 0.50). Argue from §2.8 mechanics: how much of that 0.50 ratio is attributable to Type-M error (regression to the truth as power increases) vs publication bias (the original was filter-selected) vs both? Cite Camerer et al. (2018) for the analogous 50% figure in economics.
- Pen-and-paper. Ioannidis (2005, PLoS Med) gives a formula for positive predictive value: PPV = (power × π) / (power × π + α × (1 − π)), where π is the pre-study odds of the hypothesis being true. For α = 0.05, power = 0.50, and π = 0.10 (one true hypothesis in 10), compute PPV. Now compute PPV at power = 0.80. Argue why low power not only causes false negatives but also DECREASES the trustworthiness of positive findings.
- Pen-and-paper. Egger et al. (1997, BMJ) propose the funnel-asymmetry test: regress the standardised effect (effect / SE) on precision (1/SE); a non-zero intercept indicates publication bias. For a literature with no bias, the intercept should be ≈ 0. For the §2.8 widget at strong bias, sketch why the intercept would be positive: the small-study (low-precision) cloud is pushed up by the filter, while the high-precision studies are unfiltered. The regression catches the asymmetry.
- Pen-and-paper. The American Statistical Association (Wasserstein & Lazar 2016) and Wasserstein, Schirm, Lazar (2019) recommend moving beyond the p < 0.05 cliff. List three specific reporting practices that operationalise this recommendation: (a) exact p-values, (b) effect sizes with CIs, (c) interpretive language ("evidence is consistent with", "the data suggest") rather than dichotomous ("significant"). Cite a paper you have read that does each well or poorly.
- Pen-and-paper. Argue why methodological reform alone cannot fix BAD MEASUREMENT. Take an example: a paper measures "well-being" via a 3-item Likert scale with α-Cronbach = 0.50. The paper preregisters the analysis, powers the study at 0.90, shares data and code, and reports effect sizes with CIs. The analysis is procedurally exemplary. Argue why a replication failure of this study would still leave the conclusion uncertain — what construct-validity (Cronbach & Meehl 1955) questions remain unanswered?
- Pen-and-paper. The Begley & Ellis (2012) Amgen audit found 11% reproducibility in preclinical cancer research. List three structural features of preclinical biomedicine (small animal cohorts, limited blinding, frequent ad-hoc subgroup analyses, lack of preregistration culture) that make the per-study Type-M error larger than in psychology. Argue why a Manifesto-style reform agenda in biomedicine would need to address these field-specific features, not just import the psychology playbook.
Pause and reflect: §2.8 has integrated §2.4 (p-values), §2.5 (multiple testing), §2.6 (preregistration), and §2.7 (equivalence) into a literature-level account. The empirical evidence — Open Science Collaboration (2015), Camerer et al. (2018), Many Labs 2 (Klein et al. 2018), Begley & Ellis (2012) — gives replication rates in the 11%–62% range across psychology, economics, and biomedicine. The mechanism is the compound of publication bias, p-hacking, multiple testing, and low power, with Type-M and Type-S errors (Gelman & Carlin 2014) as the literature-level signatures. The Manifesto for Reproducible Science (Munafò et al. 2017) lays out six principles; the individual-researcher agenda translates them into preregistration, power, sharing, effect-size reporting, no-dichotomy, and confirmatory/exploratory labelling. The funnel-plot widget shows how the filter inflates the literature; the checklist widget operationalises the fix. The HONEST CAVEATS — slow incentives, field-specific structural barriers, the limits of procedural reform — keep the agenda from over-promising. Part 2 closes here. Parts 3 (CIs), 7 (Bayesian), and 8 (resampling) carry the alternative inferential paradigms forward.
What you now know
You can quote the empirical replication-rate evidence: Open Science Collaboration (2015, Science) — 36% of 100 psychology effects with replicable significance; Camerer et al. (2018, Nature Human Behaviour) — 62% of 21 economics experiments; Many Labs 2 (Klein et al. 2018) — 50% of 28 effects; Begley & Ellis (2012, Nature) — 11% of 53 preclinical cancer findings. You can name the four mechanical contributors: publication bias, §2.4 p-hacking, §2.5 multiple-testing flexibility, §2.2 low power, and explain how they compound.
You can state TYPE-M error (Gelman & Carlin 2014, Perspectives on Psychological Science) — under-powered studies systematically OVER-state the magnitude of true effects, conditional on reaching significance — and TYPE-S error — the probability of reporting an effect with the WRONG SIGN. You can sketch the Ioannidis (2005, PLoS Medicine) positive-predictive-value formula and explain why low power decreases the trustworthiness of significant findings rather than merely increasing the false-negative rate.
You can recite the six Manifesto principles (Munafò et al. 2017, Nature Human Behaviour): protect against cognitive biases (preregistration, blinding); improve methodology (power, design); improve reporting (open data, open code); reproducibility training; diversify peer review; reward open practices. You can list the practical individual-researcher agenda: preregister using AsPredicted (Simonsohn et al. 2014) or OSF (Nosek et al. 2018); compute power before collecting data; share data and code on OSF or GitHub with environment files; report effect sizes with CIs (Wasserstein, Schirm, Lazar 2019, American Statistician); do not dichotomise at p = 0.05; distinguish confirmatory from exploratory; use equivalence testing (§2.7) when supporting the null is the research question.
You can use the publication-bias-funnel to see how a tunable filter empties the lower-left corner of the funnel and inflates the meta-analytic mean, and the research-reform-checklist to score a planned study across preregistration, power, analysis plan, sharing, and reporting categories. You can name the load-bearing items — preregistration, power, sharing, effect-size reporting — and identify weak categories from the per-category breakdown.
You can state the honest caveats: the incentive structure changes slowly; fields differ in reform pace; methodological reform alone cannot fix bad questions or bad measurement; replication failure has multiple causes; preregistration is necessary, not sufficient. You can sketch why bad measurement (Cronbach & Meehl 1955, Psychological Bulletin) is a separate, prior, discipline to procedural reform.
Where this lands in the rest of the book. Part 3 (confidence intervals) is the replacement for the p < 0.05 dichotomy: CIs encode effect size and uncertainty simultaneously, and the Wasserstein et al. (2019) recommendations effectively make Part 3 the default reporting unit. Part 7 (Bayesian methods) is the alternative paradigm Kruschke (2018) and Wagenmakers et al. (2018) propose for direct probabilistic statements about parameters, side-stepping the p-value misinterpretation problem. Part 8 (resampling) provides bootstrap CIs as more robust alternatives to parametric inference. Together these are the constructive complements to the §2.8 critique: §2.8 names what is broken; Parts 3, 7, and 8 build what works.
Part 2 ends here. Eight sections: §2.1 Neyman–Pearson, §2.2 power, §2.3 classical tests, §2.4 p-values, §2.5 multiple testing, §2.6 preregistration, §2.7 equivalence, §2.8 the replication crisis. Together they take a reader from the foundations of hypothesis testing to the procedural discipline needed to NOT contribute to the next replication crisis.
References
- Open Science Collaboration (2015). "Estimating the reproducibility of psychological science." Science 349(6251), aac4716. (The canonical replication paper. 100 effects, 270 contributing authors, 64 sites. 36% reproduced statistical significance with the same direction; mean replication effect size was half the original. The paper that made the replication crisis central to psychology.)
- Camerer, C.F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B.A., Pfeiffer, T., et al. (2018). "Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015." Nature Human Behaviour 2(9), 637–644. (The economics-and-social-science analogue: 21 experiments from the highest-prestige journals, replicated at ~ 5× original n. 62% replication rate, with replicated effects ~ 50% of original sizes.)
- Klein, R.A., Vianello, M., Hasselman, F., Adams, B.G., Adams, R.B. Jr., Alper, S., et al. (2018). "Many Labs 2: Investigating variation in replicability across samples and settings." Advances in Methods and Practices in Psychological Science 1(4), 443–490. (28 effects, 125 samples in 36 countries, n ≈ 15,000+. Designed specifically to control for the "hidden moderator" hypothesis. 50% of effects reached conventional significance.)
- Klein, R.A., Ratliff, K.A., Vianello, M., Adams, R.B. Jr., Bahník, Š., Bernstein, M.J., et al. (2014). "Investigating variation in replicability: A ‘many labs’ replication project." Social Psychology 45(3), 142–152. (The warm-up for Many Labs 2. 13 classical psychology effects, 36 samples. Ten of 13 replicated robustly; three (currency priming, flag priming, imagined contact) did not.)
- Ioannidis, J.P.A. (2005). "Why most published research findings are false." PLoS Medicine 2(8), e124. (The formal account: under realistic priors on hypothesis truth, realistic power, and realistic bias, the positive predictive value of a significant finding can be well below 50%. The most-cited paper ever published in PLoS Medicine.)
- Gelman, A., Carlin, J. (2014). "Beyond power calculations: assessing Type S (sign) and Type M (magnitude) errors." Perspectives on Psychological Science 9(6), 641–651. (The TYPE-M and TYPE-S formalism. Under-powered studies that reach significance over-state the true effect magnitude (Type-M) and occasionally invert its sign (Type-S). The literature-level inflation made quantitatively precise.)
- Munafò, M.R., Nosek, B.A., Bishop, D.V.M., Button, K.S., Chambers, C.D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J.J., Ioannidis, J.P.A. (2017). "A manifesto for reproducible science." Nature Human Behaviour 1(1), 0021. (The 30-author consensus statement. Six principles: protect against cognitive biases, improve methodology, improve reporting, reproducibility training, diversify peer review, reward open practices.)
- Wasserstein, R.L., Schirm, A.L., Lazar, N.A. (2019). "Moving to a world beyond 'p < 0.05'." American Statistician 73(sup1), 1–19. (The American Statistical Association's editorial introducing the special issue. Recommends moving away from the p = 0.05 dichotomy, toward effect sizes with CIs, exact p-values, and interpretive language.)
- Nosek, B.A., Ebersole, C.R., DeHaven, A.C., Mellor, D.T. (2018). "The preregistration revolution." PNAS 115(11), 2600–2606. (The OSF preregistration framework and template; the institutional case for preregistration as a default; data on uptake rates by field.)
- Begley, C.G., Ellis, L.M. (2012). "Drug development: Raise standards for preclinical cancer research." Nature 483(7391), 531–533. (The Amgen replication audit: 47 of 53 landmark preclinical cancer findings did NOT replicate (11% replication rate). The biomedicine version of the crisis.)
- Egger, M., Davey Smith, G., Schneider, M., Minder, C. (1997). "Bias in meta-analysis detected by a simple, graphical test." BMJ 315(7109), 629–634. (The funnel-plot asymmetry test for publication bias. The diagnostic the §2.8 widget makes interactive.)
- Wasserstein, R.L., Lazar, N.A. (2016). "The ASA's statement on p-values: context, process, and purpose." American Statistician 70(2), 129–133. (The 2016 ASA statement: six principles on what p-values can and cannot do. The precursor to the 2019 special issue.)
- Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). "False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant." Psychological Science 22(11), 1359–1366. (The empirical demonstration that researcher-degrees-of-freedom inflate the per-study Type-I rate to ~ 60% under realistic flexibility. The motivating paper for the reform movement.)