What a p-value is — and what it is not

Part 2 — Hypothesis testing without p-hacking

Learning objectives

State THE definition: the p-value is $P(T(X) \text{ as or more extreme than } T_{\text{obs}} \mid H_0)$ , the tail probability of the test statistic under $H_0$ ; "more extreme" follows the test's directionality (one-sided / two-sided)
Distinguish what the p-value IS (a tail probability conditional on H₀ and on the chosen statistical model) from what it IS NOT — common misinterpretations: P(H₀ | data), 1 − P(H₁ | data), probability of replication, strength of evidence on its own, effect size, scientific significance
Read p < 0.05 honestly: "if H₀ were exactly true, results this far from H₀ would occur less than 5% of the time in repeated identical experiments" — and recognise it does NOT mean "H₀ is false with > 95% probability"
Recite the SIX principles from the ASA 2016 statement on p-values (Wasserstein & Lazar): p-values can indicate incompatibility with a model, they do NOT measure P(hypothesis | data), scientific conclusions should NOT rest on p < 0.05 alone, proper inference requires full reporting, p-values do NOT measure effect size, p-values alone do NOT provide good evidence about a hypothesis
Diagnose four p-hacking mechanisms: OPTIONAL STOPPING (peek-and-stop), GARDEN OF FORKING PATHS (implicit analysis choices), MULTIPLE-COMPARISONS FISHING (test many outcomes, report best), POST-HOC OUTLIER EXCLUSION; quantify the Type-I inflation each one produces from Monte Carlo simulation
Argue why PREREGISTRATION (locking in the analysis plan, including stopping rules, outcome variables, and exclusion criteria, BEFORE seeing the data) is the procedural fix for the four mechanisms above — preview §2.6
Identify complementary / replacement reporting: CONFIDENCE INTERVALS (Part 3 §3.1) carry strictly more information than a single p; BAYES FACTORS $B_{10} = P(D \mid H_1)/P(D \mid H_0)$ give a direct evidence ratio in either direction; EFFECT SIZES with CIs are the recommended substitute (Cumming 2014)
Recognise that the same p-value at LARGE n can reflect a TRIVIAL effect (statistical significance ≠ practical significance) and at SMALL n can miss a LARGE effect (low power)
Trace the historical Fisher / Neyman-Pearson framework split (Hubbard & Bayarri 2003) and recognise the modern hybrid — "reject if p < α" — as the procedural muddle that §2.4 dissects
Preview §2.5 (multiple-testing corrections — FWER, FDR), §2.6 (preregistration mechanics), §2.7 (equivalence testing for "no meaningful effect")

§2.1 set up the Neyman–Pearson decision framework with α and β as its operating characteristics. §2.2 turned those into design-stage numbers — given (α, β, effect size, n), fix any three and the fourth pops out. §2.3 walked the t, χ², and F machinery that turns data into a test statistic and the test statistic into a p-value. §2.4 is the section that pauses, looks at the number this entire machinery produces, and asks: what is it? And what is it not?

The honest answer is brief and easy to misremember. The p-value is THE TAIL PROBABILITY OF THE TEST STATISTIC UNDER H₀ — given the data, given the test, and given the assumption that H₀ is exactly true, the p-value is the probability of seeing a test statistic at least as far from H₀ as the one you saw. It is a CONDITIONAL probability, on H₀; it is NOT a probability statement ABOUT H₀. That distinction is the single most-misunderstood point in applied statistics, and Goodman (2008) catalogued twelve named misinterpretations that recur in textbooks, papers, and grant reviews. The replication crisis (§2.8) is in large part a consequence of those misinterpretations compounding across a literature.

The §2.4 arc has six stops. First, the DEFINITION — precise, with examples in the t, z, and χ² cases. Second, the SIX MISINTERPRETATIONS — what the p-value is NOT, with the counter-example for each. Third, the AMERICAN STATISTICAL ASSOCIATION 2016 STATEMENT — the only official-position document the ASA has ever issued on a statistical practice, and the cleanest summary of the consensus position. Fourth, P-HACKING — the four mechanisms (optional stopping, forking paths, multiple-comparisons fishing, outlier exclusion) and what each does to the Type-I rate. Fifth, PREREGISTRATION as the procedural fix, with a preview of §2.6. Sixth, ALTERNATIVES — confidence intervals (Part 3), Bayes factors, and effect-size-with-CI reporting (Cumming 2014). Two widgets thread the section: a p-value distribution simulator and a p-hacking simulator.

One framing note before the definition. The p-value is a TOOL. Like any tool, it has a designed-for use case and a set of failure modes. Used inside its design envelope (single, pre-specified test on pre-collected data, interpreted as evidence against H₀, never as evidence FOR a hypothesis), it is a useful piece of evidence. Used outside its envelope (post-hoc fishing, dichotomous yes/no decisions about a continuous evidence scale, conflated with posterior probability), it produces the exact pattern of false positives and unreplicable findings that has destabilised whole literatures. §2.4 is about staying inside the envelope.

The definition, said precisely

Let $T(X)$ be a test statistic computed from data $X$ . Let $H_0$ be a fully-specified null hypothesis — a statistical model from which the distribution of $T(X)$ can be derived. Let $T_{\text{obs}}$ be the numerical value of the test statistic on the data actually observed. The p-value is

p \;=\; P\!\left( T(X) \text{ as or more extreme than } T_{\text{obs}} \;\big|\; H_0 \right).

Two things to read carefully. (i) The probability is CONDITIONAL on $H_0$ — it is computed assuming $H_0$ is exactly the truth. (ii) The phrase "as or more extreme" is determined by the test's directionality: for a one-sided test of $H_0: \mu = 0$ vs $H_1: \mu > 0$ , "extreme" means "large positive"; for a two-sided test, "extreme" means "large in absolute value"; for a chi-squared goodness-of-fit, "extreme" means "large $\chi^2$ ". The choice of directionality is a DESIGN choice, made before seeing the data — switching after the fact is one of the p-hacking mechanisms below.

Worked instances of the formula, drawing on §2.3:

One-sample t-test. $T = (\bar X - \mu_0)/(s/\sqrt n)$ ; under $H_0$ , $T \sim t_{n-1}$ . Two-sided p-value: $p = 2 P(t_{n-1} > |T_{\text{obs}}|)$ . For the §2.3 oven-thermometer example (T_obs = −1.93, df = 7), $p \approx 2 \cdot P(t_7 > 1.93) \approx 0.095$ .
Two-sample z-test. $T = (\bar X_1 - \bar X_2)/\text{SE}$ ; under $H_0$ , $T \sim \mathcal{N}(0, 1)$ . One-sided p-value: $p = 1 - \Phi(T_{\text{obs}})$ . At $T_{\text{obs}} = 1.645$ , $p = 0.05$ exactly — this is where the convention comes from.
χ² of independence. $\chi^2 = \sum (O - E)^2 / E$ ; under $H_0$ , $\chi^2 \sim \chi^2_{(r-1)(c-1)}$ . p-value: $p = P(\chi^2_{\nu} > \chi^2_{\text{obs}})$ . For the §2.3 vaccine-trial example (χ² = 4.51, df = 1), $p \approx P(\chi^2_1 > 4.51) \approx 0.034$ .

Three structural properties follow from the definition. First, the p-value is a RANDOM VARIABLE — it is a function of the data, and the data are random, so its value varies from sample to sample. Second, UNDER $H_0$ , the p-value of any continuous test statistic has an EXACTLY UNIFORM[0, 1] DISTRIBUTION. (Proof sketch: if $T$ has a continuous CDF $F_0$ under $H_0$ , then $F_0(T) \sim \text{Uniform}(0, 1)$ by the probability-integral transform; the p-value is $1 - F_0(T_{\text{obs}})$ for an upper-tail test, which is also uniform. Casella-Berger 2002 §8.3.3.) Third, UNDER $H_1$ , the p-value's distribution is RIGHT-SKEWED toward 0 — small p's become more likely than 1/20 — and the skew increases with effect size and sample size. The first widget makes both facts visible.

The first widget simulates many studies and histograms their p-values. The setup is one-sided z-test for the normal mean, $H_0: \mu = 0$ vs $H_1: \mu > 0$ , on iid Normal(μ, 1) data. The reader picks μ (= 0 for H₀, > 0 for H₁) and n; the widget runs 2000 simulated studies, computes a p-value for each, and bins them into 20 bins of width 0.05.

Things to verify in the widget:

At μ = 0 (the H₀ slider position), the histogram is FLAT. Every bin sits near 100 (= 2000 / 20), within Monte Carlo noise. The leftmost bin (p < 0.05) holds about 5% of studies — the empirical Type-I rate. This is the textbook fact that the p-value is uniform under H₀, made literal by simulation. Re-roll to see the noise envelope.
Slide μ to 0.30 at n = 30. The histogram tilts: the leftmost bins inflate, the right bins deflate. The leftmost bin now sits at ~30% — that is the POWER of the test at this effect, and you have read it directly off the histogram.
At μ = 0.50, n = 30, power is around 75-80%. The histogram is heavily right-skewed; almost three quarters of all studies report p < 0.05.
At μ = 0.10 (a tiny effect), n = 30, the histogram is almost flat — power is just above 10%. A non-significant result from this study tells you essentially nothing about the truth of H₁ at this effect size; the test does not have the resolution to see it.
At μ = 0.50, slide n from 30 up to 100. The histogram becomes near-degenerate at the leftmost bin — every study rejects. Power approaches 1. This is the structural lesson §2.2 made formal: more n collapses the p-value distribution toward 0 under any true H₁.

What the p-value is NOT — six common misinterpretations

The p-value is a tail probability under H₀. Most published-paper-and-textbook misuse comes from sliding to one of the following SIX named misinterpretations. Each is wrong; for each, the counter-example illustrates how badly.

Misinterpretation 1: "The p-value is P(H₀ | data)." The most common error. The p-value is $P(T(X) ;|; H_0)$ ; the misinterpretation flips the conditioning to $P(H_0 ;|; T(X))$ . These are different probabilities and require different machinery. Bayes' rule says

P(H_0 \mid \text{data}) = \frac{P(\text{data} \mid H_0) \cdot P(H_0)}{P(\text{data} \mid H_0) P(H_0) + P(\text{data} \mid H_1) P(H_1)}.

Computing $P(H_0 \mid \text{data})$ requires (i) the prior $P(H_0)$ — the credibility you assigned to H₀ before seeing data — and (ii) the likelihood under H₁ — not just H₀. Frequentist testing uses NEITHER. A p-value of 0.03 is fully compatible with $P(H_0 \mid \text{data})$ being 0.50, 0.10, 0.90, or anywhere in between, depending on the prior and on $P(\text{data} \mid H_1)$ . Sellke, Bayarri & Berger (2001, Am Stat) showed that for one-sided z-tests with a flat prior, $P(H_0 \mid \text{data}) \ge -e \cdot p \cdot \log p$ — a much weaker bound than "p < 0.05 means H₀ < 5% likely". At p = 0.05, the corresponding posterior is about 0.29, not 0.05.

Misinterpretation 2: "1 − p is the probability that H₁ is true." Same error, complemented. If p does not equal P(H₀ | data), then 1 − p does not equal P(H₁ | data). The misinterpretation often hides as "p = 0.03 means the result is 97% certain to replicate" — a claim with no formal frequentist or Bayesian basis (Goodman 1992; Cumming 2008 Perspect Psychol Sci).

Misinterpretation 3: "The p-value is the probability the result was due to chance." Slippery wording. If you fix H₀ as "chance alone" (no real effect), then p is the chance of observing data at least this extreme IF chance were operating. The misinterpretation reverses this to "p is the probability that chance, not the alternative, produced the data" — which is again P(H₀ | data) and requires a prior. The wording is in nearly every introductory textbook; it is wrong.

Misinterpretation 4: "p < 0.05 means the result will replicate." No. A p-value tells you about one experiment's data under H₀; it tells you nothing directly about a future experiment. Cumming (2008) ran the simulation: at p = 0.05 on the first study, the 80% prediction interval for the next study's p is approximately (0.0008, 0.44) — a single p of 0.05 is consistent with future p's anywhere from 0.0008 to 0.44. The "dance of the p-values" is the empirical fact that single-study p-values are noisy and replication probability is a separate quantity that depends on the underlying effect size, the new study's n, and the new study's α.

Misinterpretation 5: "Small p means a big effect." No. The p-value depends on EFFECT SIZE × √n × signal-to-noise: a small effect with a huge n gives a tiny p that is statistically significant but practically trivial; a large effect with a small n can give a large p (low power) that is non-significant but practically important. The classic illustration: a clinical trial comparing two drug doses at n = 10000 per arm finds p = 0.001 for a mean blood-pressure reduction difference of 0.3 mmHg — statistically incontrovertible, clinically irrelevant. The §2.2 effect-size-translator widget made this concrete; §2.4 reiterates: the p-value is a HYPOTHESIS-COMPATIBILITY summary, not an effect-size summary. Always report effect size and confidence interval ALONGSIDE the p (Wilkinson & APA 1999; ICMJE 2018).

Misinterpretation 6: "A non-significant result means H₀ is true." No. As §2.1 said in the framing and §2.2 reiterated for underpowered designs: failing to reject H₀ is consistent with EITHER H₀ being true OR H₁ being true but not detectable at this n. Altman & Bland (1995, BMJ) put it crisply: "absence of evidence is not evidence of absence". If you genuinely want to support the claim of NO EFFECT, you need equivalence testing — TOST (two one-sided tests, Schuirmann 1987) — which §2.7 will cover. A p of 0.30 from an underpowered study tells you you have not seen the effect; it does not tell you the effect is not there.

The ASA 2016 statement on p-values — six principles

In 2016 the American Statistical Association issued an official statement on p-values — the only formal-position document on a statistical practice the ASA has ever published (Wasserstein & Lazar 2016, Am Stat). The statement crystallised twenty years of methodological-statistics consensus into six numbered principles. They are short, blunt, and worth memorising verbatim.

P-values can indicate how incompatible the data are with a specified statistical model. This is the positive content: a small p means the data would be surprising under the model that includes H₀. That is useful information.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. A direct rejection of misinterpretations 1, 2, 3 above.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. A rejection of the dichotomous p < 0.05 cutoff as a decision rule.
Proper inference requires full reporting and transparency. Specifically, all the comparisons that were run, all the analytic choices that were made, all the data that were excluded. Selective reporting (publishing only significant findings, only one of many tried analyses) breaks the inference.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. A direct rejection of misinterpretation 5.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. The p-value is one piece of evidence; standing alone it is insufficient. Reporting must include effect size, confidence interval, and ideally pre-specification.

The 2019 follow-up statement (Wasserstein, Schirm & Lazar 2019, Am Stat) went further: "We conclude, based on our review of the articles in this special issue and the broader literature, that it is time to stop using the term 'statistically significant' entirely." Some journals (e.g., Basic & Applied Social Psychology) have banned p-values outright; most have adopted softer guidance (effect size + CI required alongside p, no dichotomous "significant / non-significant" language in abstracts). The recommended-direction-of-travel is clear: REPORT THE P, but never alone, and never with the binary cutoff doing the conclusion-work.

p-hacking: four mechanisms and why they each inflate Type-I

The ASA principle #4 — "proper inference requires full reporting" — exists because there are FOUR known mechanisms by which a single nominally-α-controlled test ceases to be α-controlled when the analysis is not pre-specified. Simmons, Nelson & Simonsohn (2011, Psychol Sci) called the bundle "false-positive psychology" and ran the simulation that put numbers on each one. The four mechanisms:

Mechanism 1: optional stopping (data peeking). The researcher collects data in batches, runs the test after each batch, and stops as soon as p < 0.05. If H₀ is exactly true, the empirical Type-I rate of this procedure is NOT 5%. Armitage, McPherson & Rowe (1969, JRSS A) computed the rate: with continuous peeking, the cumulative probability of EVER hitting p < 0.05 under H₀ exceeds 50% in the limit. With realistic peek schedules (every 10 observations, capped at n = 100), it is around 20-25%. The fix: pre-specify the n, or pre-specify the stopping rule via sequential-analysis machinery (Wald 1947; O'Brien-Fleming 1979) that explicitly controls the Type-I rate ACROSS the peeks.

Mechanism 2: garden of forking paths. Gelman & Loken (2014, Am Sci) coined the metaphor. Confronted with a dataset, a thoughtful researcher makes many small analytic choices: which variables to include, which transformation to apply, whether to log or square-root the outcome, whether to use a t-test or a Wilcoxon, whether to cluster the standard errors, whether to add a covariate. Each choice is a BRANCH. If only one branch is reported (the one with p < 0.05) but many were considered, the EFFECTIVE Type-I rate is inflated by the implicit multiple testing. The pernicious feature is that the researcher need not be CONSCIOUSLY trying anything — the inflation comes from the implicit selection.

Mechanism 3: multiple-outcome fishing. The researcher pre-specifies n, but measures K outcome variables on each subject. The headline result reports whichever outcome produced the smallest p. Under independence, if each outcome's p is uniform under H₀, the smallest of K is no longer uniform; the probability of at least one being < 0.05 is $1 - (1 - 0.05)^K$ . For K = 5, that's 22.6%; for K = 20, it's 64%. The standard fix: pre-specify ONE primary outcome (Bonferroni-correct the rest, or use FDR control — §2.5).

Mechanism 4: post-hoc outlier exclusion. Compute the t-test on the full sample. If p > 0.05, look at the residuals, identify "outliers" (say, > 2σ from the mean), drop them, re-test. Report whichever p is smaller. Even when the exclusion rule is "principled" (e.g., 2σ), the AS-USED procedure is data-dependent — and conditioning on "the p was smaller after the drop" inflates Type-I above nominal.

The second widget runs all four mechanisms (plus a "no-hack" baseline) under H₀ TRUE and reports the empirical Type-I rate of each. The chart side-by-sides the bars; the dashed green line is the nominal 5% target. The point is to make the inflation visible — these are not academic concerns, they are visible at the bar-chart resolution after a few hundred simulated trials.

Things to verify:

No-hacking baseline. Empirical rate is at 5% within Monte Carlo noise (3-σ noise envelope on 2000 trials is ±1.5 percentage points). Re-roll a few times — the bar dances around 5% but stays in band.
Optional stopping (start n = 10, step 10, cap n = 100). Empirical Type-I climbs to ~20-25%. Four to five times the nominal rate. This is the Armitage et al. (1969) result in pixels.
Multi-outcome fishing (5 outcomes). Empirical Type-I hits ~22%, matching the theoretical $1 - (1 - 0.05)^5 \approx 22.6%$ .
Garden of forking paths (3 analyses on the same data). Empirical Type-I lands around 10-13% — lower than 5-outcome fishing because the three analyses are correlated (they all use the same data), but still 2-3× the nominal rate.
Outlier dropping (> 2σ). Empirical Type-I climbs to ~6-9%. The inflation is smaller than the other three (only a fraction of trials have any outliers to drop), but consistent across re-rolls. Combined with the other mechanisms in a real study, it compounds.
Simmons-Nelson-Simonsohn (2011) showed that COMBINING four such mechanisms takes the false-positive rate from 5% to ~60%. The widget runs them one at a time, but the real-world worry is the conjunction.

Preregistration: the procedural fix

The four p-hacking mechanisms share a common cause: the analysis was decided AFTER seeing the data, or revised in light of the data. PREREGISTRATION is the procedural fix. The researcher writes down — and timestamps in a public registry, such as OSF Registries, AsPredicted, or ClinicalTrials.gov — the FULL analysis plan BEFORE data collection: the hypotheses, the sample size, the primary outcome, the test statistic, the exclusion criteria, the stopping rule. After data collection, any analysis that matches the pre-registered plan is CONFIRMATORY (and its p-value is interpretable in the usual N-P sense); anything else is EXPLORATORY (and must be labelled as such).

Preregistration closes optional stopping (n is fixed), the forking paths (the path is fixed), the multi-outcome fishing (the primary is fixed), and post-hoc outlier dropping (the exclusion criteria are fixed). It does not eliminate exploratory work — which remains a valid and useful mode of inquiry — but it requires explicit labelling. The Nosek et al. (2018, PNAS) review found that pre-registered psychology studies had ~ 50% replication rates, versus ~ 36% for non-pre-registered. The 14-percentage-point gap is one direct measurement of how much p-hacking was inflating the literature's false-positive rate. §2.6 will work out the preregistration mechanics in detail.

Better alternatives and complements

The ASA 2019 follow-up did not advocate abandoning frequentist inference; it advocated abandoning the DICHOTOMOUS p < 0.05 verdict. Three replacements / complements are pulled from the methodological-statistics consensus:

(i) Confidence intervals (Part 3 §3.1). A 100(1 − α)% CI for the parameter $\theta$ is exactly the set of $\theta_0$ values that a size-α test would NOT reject as the null. (Recall §2.1's test-inversion identity.) The CI carries strictly more information than a single-null p-value: it tells you not just "is θ₀ rejected at α" but "which range of θ values is compatible with the data at α". When the CI is narrow and excludes zero, you have evidence of a precise effect; when the CI is wide and includes zero, you have an UNDERPOWERED study (Cohen 1988; Cumming 2014).

(ii) Bayes factors. The frequentist p-value is a tail probability under H₀ alone. The Bayes factor

B_{10} = \frac{P(D \mid H_1)}{P(D \mid H_0)}

is a direct ratio of the data's likelihood under H₁ to its likelihood under H₀ — symmetric in the two hypotheses, computed without a prior on the hypotheses, and naturally giving evidence in BOTH directions. $B_{10} = 10$ means the data are 10× more likely under H₁ than under H₀ (a "strong" evidence threshold by Kass & Raftery 1995). $B_{10} = 1/10$ means the reverse — strong evidence FOR H₀. The frequentist p-value cannot say the second thing; the Bayes factor can. The cost: you need to specify the distribution under H₁ (often by choosing a "default" prior on the alternative effect, e.g., the JZS prior of Rouder et al. 2009 Psychon Bull Rev), which adds modelling overhead.

(iii) Effect sizes with CIs as the primary reporting unit. Cumming (2014, Psychol Sci) called it "the new statistics": report the effect size (Cohen's d, OR, NNT, whatever is natural to the test) with its CI, and let the reader interpret. If the CI excludes zero, the effect is non-zero at the test's level; if the CI is narrow, the estimate is precise; if the CI is wide, the data are underpowered. The information content is strictly greater than a single p, and the reader is not invited to make a binary accept/reject decision.

None of these alternatives makes the p-value obsolete; each makes the p-value insufficient when reported alone. The honest §2.4 recommendation: REPORT the p-value with full context — the test, the n, the directionality, the pre-registration status — and the effect size with its confidence interval. Where the audience can interpret a Bayes factor, report that too. Never report a p-value as a binary verdict.

The hybrid framework is the underlying confusion

One last historical / structural note before the recap. §2.1 flagged the muddle: modern textbook testing is a HYBRID of Fisher's significance testing (compute a p, interpret as evidence strength, no pre-set α) and Neyman-Pearson hypothesis testing (pre-set α, compute a decision, control long-run error rates). The hybrid is internally inconsistent. Fisher's p-value is a CONTINUOUS evidence summary; the Neyman-Pearson α is a BINARY-output operating characteristic. The standard "reject if p < 0.05" rule treats the continuous evidence as if it were a binary decision and dichotomises at an arbitrary cutoff — that is exactly the practice the ASA 2019 statement asks us to stop.

Hubbard & Bayarri (2003, Am Stat) traced this hybrid back to the 1940s textbook synthesis that papered over the Neyman-Fisher dispute by presenting both their machineries side by side without naming the philosophical conflict. The consequence: practitioners learn the N-P mechanics, interpret the Fisherian evidence summary, and conclude with a hybrid decision that neither Fisher nor Neyman would have endorsed. §2.4 untangles it by reading the p-value as the FISHERIAN evidence summary it actually is (a tail probability under H₀, a continuous measure of incompatibility with H₀) and recommending the COMPLEMENTARY machinery — pre-registration for procedural discipline, confidence intervals for precision, Bayes factors for symmetric evidence, effect sizes for substantive significance.

Try it

In the pvalue-distribution widget, set μ = 0 (the H₀ slider position), n = 30, studies = 2000. Run, re-roll three times. Each time, verify the histogram is flat within Monte Carlo noise. Read P(p < 0.05) from the status table — it should hover near 0.050. This is the Type-I rate, by design.
Same widget, μ = 0.3, n = 30. Power = P(p < 0.05) under H₁ — should be ~30%. Compute the theoretical power by hand: $\text{power} = 1 - \Phi(z_{0.95} - \mu\sqrt n) = 1 - \Phi(1.645 - 0.3 \cdot \sqrt{30}) = 1 - \Phi(1.645 - 1.643) = 1 - \Phi(0.002) \approx 0.50$ . Wait — the widget shows ~30%, not 50%. Re-check: μ in the widget is in σ units with σ = 1, so the effect size in the formula is μ itself. $\text{power} = 1 - \Phi(1.645 - 0.3 \sqrt{30}) = 1 - \Phi(-0.00) = 0.50$ ? But empirically I see 30%. Reconcile: the test in the widget is one-sided and the simulation does NOT use the formula above directly — it draws data, computes the t (not z, since σ is estimated) and computes the p from t_{n-1}, not Φ. The t-distribution with df = 29 has heavier tails than the standard normal, so the critical value is slightly larger than 1.645 (it is ~1.699 for t_{29}). Correcting: $\text{power} \approx 1 - F_{t_{29}}(1.699 - 0.3\sqrt{30}) \approx 1 - F_{t_{29}}(0.057) \approx 0.48$ . The mismatch between this and the widget's ~30% reflects two effects: (a) the widget uses a z-test on √n·X̄ assuming σ = 1, so the calc above with t was wrong — the z-test power IS ~50%; (b) Monte Carlo noise. Re-run with 5000 studies; the empirical power should land near 50%.
Same widget, μ = 0.1, n = 30. Power ≈ 11%. A non-significant result at this design tells you essentially nothing about whether H₁ is true. Now slide n up to 200 with μ = 0.1; power climbs to ~80%. Same effect size, different power; the test is now adequately powered to detect a 0.1σ effect.
In the p-hacking-simulator, click "1. No hacking". Run, re-roll three times. The bar should orbit 5%. The Monte Carlo noise envelope is ±~1.5pp at 2000 trials.
Same widget, click "2. Optional stopping". Empirical Type-I should land around 20-25%. The MAGNITUDE of the inflation — 4-5× nominal — is the visceral point. A reviewer who reads "p < 0.05 at the planned final n" from an optionally-stopped study has been served a 20%-false-positive test, not a 5%-false-positive test.
Same widget, click "3. Multiple-outcome fishing". Type-I lands around 22%. Theoretical: $1 - (1 - 0.05)^5 = 0.226$ . Verify the empirical matches.
Same widget, click "4. Garden of forking paths". Type-I lands around 10-13%. Lower than 5-outcome fishing because the three forks here (raw t, sign, log-transform) are correlated — same data, different analyses — so the multiple-testing inflation is less than independent. The principle (inflation occurs) still holds.
Pen-and-paper. Translate the $1 - (1 - 0.05)^K$ formula into a Bonferroni-corrected per-test α. For K = 10 outcomes at family-wise α = 0.05, the per-test α is $0.05/10 = 0.005$ . Equivalently: the smallest of 10 nominal-0.05 tests has an effective Type-I of $1 - (1 - 0.05)^{10} \approx 40%$ ; correcting each to α = 0.005 brings the family-wise back to 5%. §2.5 will work this out as the FWER correction.
Pen-and-paper. A drug trial has n = 1000 per arm and reports a mean systolic blood-pressure reduction of 0.5 mmHg, p = 0.001. Compute Cohen's d if the within-arm SD is ~15 mmHg: $d = 0.5 / 15 \approx 0.033$ — a tiny effect by Cohen 1988 (small = 0.2, medium = 0.5, large = 0.8). The p is incontrovertible but the effect is clinically negligible. Discuss: what should the trial report ALONGSIDE the p?
Pen-and-paper. A pre-registered study reports the planned t-test with p = 0.04. A reviewer asks: "could you also run a non-parametric test as a robustness check?" The author runs Wilcoxon and gets p = 0.07. They report both. How is this different from p-hacking? (Hint: the additional test was PRE-DECLARED as a robustness check, not as the primary; the inference still hinges on the pre-registered analysis; reporting both is the ASA principle #4 transparency requirement.) Where would it have been p-hacking? (Hint: if the t-test had given p = 0.10 and they had switched to reporting Wilcoxon as the primary.)
Pen-and-paper. Convert the §2.3 vaccine-trial χ² = 7.69, df = 1, p ≈ 0.0056 into a Bayes factor $B_{10}$ . The Sellke-Bayarri-Berger (2001) approximate bound is $B_{10} \le -1/(e \cdot p \cdot \log p) = -1/(2.718 \cdot 0.0056 \cdot \log 0.0056) = 1/(2.718 \cdot 0.0056 \cdot 5.19) \approx 12.7$ . So the data are at most ~12.7× more likely under H₁ than under H₀ — "strong" but not "decisive" evidence in Kass-Raftery terms. The p of 0.0056 sounds more dramatic than the Bayes factor; the Sellke bound is one way to read across.

Pause and reflect: §2.4 has dissected the single most-misunderstood number in applied statistics. The p-value is a tail probability under H₀, conditional on a specified statistical model. It is NOT a posterior probability of H₀; it is NOT the probability of replication; it is NOT the strength of evidence on its own; it is NOT the effect size. It IS a useful evidence summary when used inside its designed envelope: a single, pre-specified test on pre-collected data, interpreted as evidence against H₀, reported with effect size and confidence interval. Out of envelope — optional stopping, forking paths, multi-outcome fishing, post-hoc outlier dropping — it produces the false-positive rates that destabilised whole literatures. The ASA 2016/2019 statements asked us to stop the dichotomous p < 0.05 verdict and instead report effect size with CI, with the p as a complement. §2.5 (multiple-testing corrections), §2.6 (preregistration), and §2.7 (equivalence testing) build out the procedural framework that lets us interpret p-values honestly.

What you now know

The p-value is $P(T(X) \text{ as or more extreme than } T_{\text{obs}} \mid H_0)$ — a tail probability under H₀, conditional on the statistical model. Under H₀ (and a continuous test statistic), the p-value is exactly Uniform[0, 1]. Under H₁ it is right-skewed toward 0; the proportion in the leftmost (p < 0.05) bin is the power 1 − β.

You know the six MISINTERPRETATIONS: the p-value is NOT P(H₀ | data), NOT 1 − P(H₁ | data), NOT the probability of replication, NOT the strength of evidence on its own, NOT the size of the effect, NOT the practical significance. The ASA 2016 statement codified six principles confirming these reads; the 2019 follow-up asked us to stop using the term "statistically significant" entirely.

You can diagnose the four P-HACKING mechanisms — optional stopping, forking paths, multi-outcome fishing, post-hoc outlier exclusion — and quantify the Type-I inflation each produces from the second widget. You know that PREREGISTRATION is the procedural fix, and that the Nosek et al. (2018) replication-rate gap (50% pre-reg vs 36% non-pre-reg) is one direct measurement of how much the p-hacking inflated the published literature.

You know the three complementary / replacement reporting units: CONFIDENCE INTERVALS (the test-inversion identity of §2.1 means a CI is strictly more informative than a single-null p), BAYES FACTORS (a direct evidence ratio in either direction, with a default-prior cost), and EFFECT SIZES with CIs as the "new statistics" primary reporting unit (Cumming 2014).

Where this lands in Part 2. §2.5 handles multiple-testing corrections — FWER (Bonferroni, Holm-Bonferroni) and FDR (Benjamini-Hochberg) — the analytic fix for the multi-outcome fishing problem when post-hoc multiplicity is unavoidable. §2.6 works out the mechanics of pre-registration, the operational complement of §2.4. §2.7 covers equivalence testing (TOST, Schuirmann 1987): the right way to support "no meaningful effect", since a non-significant frequentist test cannot. §2.8 the replication crisis — what happens to a literature when these structural failures compound across studies.

References

Wasserstein, R.L., Lazar, N.A. (2016). "The ASA's statement on p-values: context, process, and purpose." The American Statistician 70(2), 129–133. (The ASA 2016 statement. The six principles. Required reading for §2.4.)
Wasserstein, R.L., Schirm, A.L., Lazar, N.A. (2019). "Moving to a world beyond 'p < 0.05'." The American Statistician 73(sup1), 1–19. (The 2019 follow-up; recommends abandoning the dichotomous significance cutoff and the term "statistically significant".)
Goodman, S. (2008). "A dirty dozen: twelve p-value misconceptions." Seminars in Hematology 45(3), 135–140. (The canonical catalogue of misinterpretations. The "six" listed in §2.4 are the most common subset; Goodman has twelve.)
Hubbard, R., Bayarri, M.J. (2003). "Confusion over measures of evidence (p's) versus errors (α's) in classical statistical testing." The American Statistician 57(3), 171–178. (Definitive critique of the N-P/Fisher hybrid muddle.)
Gelman, A., Loken, E. (2014). "The statistical crisis in science." American Scientist 102(6), 460. (The "garden of forking paths" metaphor and the original argument that implicit analysis choices inflate Type-I.)
Simmons, J.P., Nelson, L.D., Simonsohn, U. (2011). "False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant." Psychological Science 22(11), 1359–1366. (The empirical demonstration that combining several p-hacking strategies yields ~60% false-positive rates without conscious cheating.)
Ioannidis, J.P.A. (2005). "Why most published research findings are false." PLoS Medicine 2(8), e124. (The seminal modelling paper that connected p-hacking, low power, and publication bias to the literature's reproducibility problems.)
Fisher, R.A. (1926). "The arrangement of field experiments." Journal of the Ministry of Agriculture of Great Britain 33, 503–513. (Origin of the 0.05 cutoff as a rule of thumb. Fisher emphasised it was a guideline, not a decision threshold.)
Sellke, T., Bayarri, M.J., Berger, J.O. (2001). "Calibration of p values for testing precise null hypotheses." The American Statistician 55(1), 62–71. (The $B_{10} \le -1/(e p \log p)$ bound; the Bayesian re-reading of a p-value.)
Kass, R.E., Raftery, A.E. (1995). "Bayes factors." Journal of the American Statistical Association 90(430), 773–795. (Bayes factors and the canonical evidence thresholds.)
Cumming, G. (2008). "Replication and p intervals." Perspectives on Psychological Science 3(4), 286–300. (The "dance of the p-values" simulation: p = 0.05 from one study is consistent with future p's anywhere in (0.0008, 0.44). The p-value is not the replication probability.)
Cumming, G. (2014). "The new statistics: why and how." Psychological Science 25(1), 7–29. (Effect-size-with-CI as the primary reporting unit; the modern recommended reporting practice.)
Armitage, P., McPherson, C.K., Rowe, B.C. (1969). "Repeated significance tests on accumulating data." Journal of the Royal Statistical Society A 132(2), 235–244. (Optional-stopping Type-I inflation rates; the foundational paper.)
Nosek, B.A., Ebersole, C.R., DeHaven, A.C., Mellor, D.T. (2018). "The preregistration revolution." PNAS 115(11), 2600–2606. (The pre-registration replication-rate gap; one of the canonical empirical demonstrations that pre-registration reduces false positives.)
Altman, D.G., Bland, J.M. (1995). "Absence of evidence is not evidence of absence." BMJ 311, 485. (The crisp formulation of why p > 0.05 does NOT mean H₀ is true.)
Wasserman, L. (2004). All of Statistics. Springer. (Chapter 10 is the cleanest one-chapter survey for the p-value's formal properties.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (§8.3 has the probability-integral-transform proof that p is Uniform[0, 1] under H₀.)