The Neyman–Pearson framework, honestly

Part 2 — Hypothesis testing without p-hacking

Learning objectives

Cast hypothesis testing as a DECISION problem: based on data, choose H₀ or reject H₀ — distinct from the question "is H₀ true?"
State the Neyman–Pearson framework: null hypothesis H₀, alternative H₁, test statistic T(X), rejection region R, decision rule "reject H₀ iff T(X) ∈ R"
Define Type-I error rate (size) $\alpha = P(T(X) \in R \mid H_0)$ — the false-positive rate, controlled by design
Define Type-II error rate $\beta(\theta) = P(T(X) \notin R \mid \theta), \theta \in H_1$ and POWER $1 - \beta(\theta)$ — the false-negative rate and its complement
State the Neyman–Pearson LEMMA for simple-vs-simple H₀ vs H₁: among all size-α tests, the LIKELIHOOD-RATIO test $\Lambda(x) = L(x \mid \theta_1) / L(x \mid \theta_0) > k_\alpha$ is most powerful (Neyman & Pearson 1933)
Extend to composite hypotheses: a UNIFORMLY MOST POWERFUL (UMP) test maximises power at every $\theta \in H_1$ ; UMP tests exist for one-sided alternatives in one-parameter exponential families (Karlin-Rubin); generalised likelihood-ratio tests (LRT) are the workhorse when UMP doesn't
Read the α–β trade-off: at fixed n, shrinking α inflates β; growing n shrinks both but the COUPLING never goes away — the only way to escape it is more data
Articulate the DECISION-THEORETIC (Neyman-Pearson) vs INFERENTIAL (Fisher) split and recognise that modern practice is a confused hybrid — §2.4 will untangle the p-value side of the muddle
Trace the "p < 0.05" convention to Fisher (1925) as a rough rule of thumb — not derived, not optimal, partly an arithmetic accident from pre-computer tables — and the psychological/sociological capture that followed (Hubbard & Bayarri 2003; Wasserstein et al. 2019)
Preview the connection to confidence intervals (Part 3 §3.1): a 100(1−α)% CI for θ is exactly the set of θ₀ that a size-α test does NOT reject, so CIs and N-P tests are two views of the same machinery

Part 1 built the estimator catalogue and its asymptotic theory. Every estimator is a random variable; we know its bias, variance, sampling distribution, large-sample limit. Part 2 opens the next question: given an estimate, how do you make a DECISION? Specifically, given data, do you act AS IF some scientifically-interesting hypothesis is true (e.g., "this drug works", "this geological layer is above the salt", "this regression coefficient is nonzero")?

Hypothesis testing is the formal answer. It is a 90-year-old framework, due primarily to Jerzy Neyman and Egon Pearson, with deep roots in Fisher's earlier significance-testing work. It is also, in modern practice, the single most-misused machinery in applied statistics — the p-value cargo cult, the replication crisis, the p-hacking arms race. §2.1 lays the framework honestly: what it IS, what it ISN'T, and the structural reasons later sections (§2.4 on p-values, §2.5 on multiple testing, §2.6 on preregistration, §2.8 on the replication crisis) need to exist at all.

The §2.1 arc has six stops. First, the FRAMING — testing as a DECISION procedure, not as a truth-detector. Second, the N-P FORMALISM — null, alternative, test statistic, rejection region, decision rule. Third, the TWO ERROR TYPES — α (false positive) and β (false negative), with power = 1 − β as the operationally useful summary. Fourth, the NEYMAN-PEARSON LEMMA — the theorem that says the likelihood-ratio test is most powerful for simple-vs-simple problems, and the UMP extension via Karlin-Rubin and generalised LRTs. Fifth, the DECISION-THEORETIC vs INFERENTIAL split — N-P's long-run error-rate framing vs Fisher's strength-of-evidence framing, and how modern practice has muddled them. Sixth, the 0.05 CONVENTION — its origin, its arbitrariness, and the structural reasons it persists. Two widgets thread the section.

Hypothesis testing is a DECISION, not a truth-finder

The first move is to be precise about what hypothesis testing does and does not do. It is NOT a procedure that takes data in and returns the answer to "is H₀ true?". The hypothesis $H_0$ is either true or it isn't — a single number-of-the-universe — and no statistical procedure can read it off finite noisy data.

What testing DOES is take data in and return a DECISION: act as if H₀ is true, or act as if H₁ is true. The decision is a function of the data; if you repeated the experiment, you would get a different draw of the data and possibly a different decision. The procedure's quality is measured by the LONG-RUN ERROR RATES of the decision rule across hypothetical replications — Type-I rate (reject H₀ when H₀ true) and Type-II rate (fail to reject H₀ when H₁ true). This is the Neyman–Pearson framing: a test is a DECISION RULE, evaluated by its OPERATING CHARACTERISTICS.

Why this distinction matters: every textbook misinterpretation of a p-value confuses "my test rejected at α = 0.05" with "there is a 95% chance H₁ is true". The first is a statement about a DECISION procedure under H₀; the second is a (Bayesian-flavoured) statement about a hypothesis. They are not the same and conflating them is the most common error in applied work (Hubbard & Bayarri 2003 §1; Greenland et al. 2016, Eur. J. Epidemiology). §2.4 will return to this in detail when it dissects what a p-value actually is. §2.1 just needs you to start the section with the DECISION framing fixed in mind.

The Neyman–Pearson formalism

A hypothesis-testing problem has four pieces:

A parameter space $\Theta$ — the set of possible values of the parameter $\theta$ governing the data-generating process. For testing a population mean, $\Theta = \mathbb{R}$ ; for a proportion, $\Theta = [0, 1]$ .
A null hypothesis $H_0: \theta \in \Theta_0$ and an alternative $H_1: \theta \in \Theta_1$ with $\Theta_0 \cap \Theta_1 = \emptyset$ . "Simple" means the hypothesis is a single point (e.g., $H_0: \theta = 0$ ); "composite" means a set of points (e.g., $H_1: \theta > 0$ ).
A test statistic $T(X)$ — a function of the data, designed to be SMALL under $H_0$ and LARGE under $H_1$ (or vice versa). For the normal-mean problem with iid $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ and known $\sigma$ , the natural statistic is $T = \sqrt n \bar X / \sigma$ .
A rejection region $R \subseteq \mathbb{R}$ (or $\mathbb{R}^k$ in the multivariate case). The DECISION RULE is: reject $H_0$ if and only if $T(X) \in R$ .

The rejection region is the entire specification of the test. Two tests with the SAME statistic but different $R$ are different tests; two tests with different statistics but the same rejection probabilities under H₀ and H₁ have the same operating characteristics. The N-P framework reduces test design to ONE QUESTION: given a statistic $T$ , what $R$ should you pick?

The answer is driven by the two error rates. First, the Type-I error rate (also called the size) is

\alpha = \sup_{\theta \in \Theta_0} P_\theta(T(X) \in R).

Reading: the probability of rejecting $H_0$ when $H_0$ is true. For a SIMPLE $H_0$ this is just $P_{\theta_0}(T(X) \in R)$ ; for a composite $H_0$ you take the worst case over $\Theta_0$ . The N-P design philosophy is: pick α first — at a level you can defend (0.05, 0.01, 5σ ≈ 6 × 10⁻⁷ in particle physics, etc.) — then choose $R$ to satisfy $\sup_{\theta \in \Theta_0} P_\theta(R) \le \alpha$ .

Second, the Type-II error rate is

\beta(\theta) = P_\theta(T(X) \notin R), \qquad \theta \in \Theta_1.

Reading: the probability of FAILING to reject when $H_1$ is the truth at parameter value $\theta$ . Unlike α, β is a FUNCTION of where in $\Theta_1$ the truth lies — it is small for $\theta$ far from the null (easy to detect) and large for $\theta$ near the null (hard to detect). The complement

\text{power}(\theta) = 1 - \beta(\theta) = P_\theta(T(X) \in R)

is the POWER FUNCTION of the test. Power is the operationally useful quantity: it says "if the truth is θ, what fraction of experiments will this test detect it?" A test with high power is one that catches real effects; a test with low power is one that often misses them, regardless of how strict its α is.

The first widget makes the α–β trade-off visible. The setup is the simple-vs-simple normal-mean problem: iid $X_1, \ldots, X_n \sim \mathcal{N}(\mu, 1)$ with $\sigma = 1$ known, testing $H_0: \mu = 0$ vs $H_1: \mu = \mu_1$ for a user-chosen $\mu_1 > 0$ . The test rejects when $\bar X > c_\alpha = z_{1-\alpha}/\sqrt n$ . Two densities — the sampling distribution of $\bar X$ under H₀ (blue) and under H₁ (orange) — are laid out on the same axis, separated by the vertical red $c_\alpha$ line. Four regions are shaded by their meaning:

Red (under blue, right of c_α): Type-I region. Area = α = P(reject H₀ | H₀ true). False positives.
Orange-hatched (under orange, left of c_α): Type-II region. Area = β = P(fail to reject | H₁ true). False negatives.
Green (under orange, right of c_α): Power region. Area = 1 − β = P(reject | H₁ true). Operationally useful.
White (under blue, left of c_α): Correct acceptance. Area = 1 − α.

Things to verify in the widget:

At the default (α = 0.05, μ₁ = 0.5σ, n = 30): $c_\alpha \approx 1.645/\sqrt{30} \approx 0.30$ , β ≈ 0.42, power ≈ 0.58. Underpowered by the Cohen (1988) 80% benchmark — but a very common operating point in published work.
Shrink α to 0.005 at the same (μ₁, n). c_α moves RIGHT (the red line slides into the orange density), the red shaded area collapses, the orange-hatched area BALLOONS, power drops to ≈ 0.18. The α–β trade-off in one slider move.
At α = 0.05, μ₁ = 0.5σ, slide n up to 200. Both densities sharpen (SD = 1/√200 ≈ 0.07). The red region SHRINKS, the orange-hatched region SHRINKS, power climbs to ≈ 0.99. More data is the only thing that escapes the trade-off.
Set μ₁ = 0.1σ at n = 30 (a tiny effect). Even at α = 0.20 the power is still under 50% — small effects need either huge n or a willingness to live with high α. This is the structural reason the replication crisis (§2.8) hit psychology and biomedicine hardest: typical effect sizes there are 0.2σ, typical samples are 30–100, so power was systematically around 25–35% (Button et al. 2013, Nat Rev Neurosci) — i.e., MOST published "significant" results were Type-M-error-distorted or outright false.

The Neyman–Pearson lemma — and what "most powerful" means

Given the framework, the design question becomes: among all size-α tests of $H_0$ vs $H_1$ , which one maximises power? For SIMPLE-VS-SIMPLE hypotheses ( $H_0: \theta = \theta_0$ vs $H_1: \theta = \theta_1$ ), the answer is the Neyman–Pearson Lemma (Neyman & Pearson 1933, Phil. Trans. R. Soc. A):

Theorem. Let $L(x \mid \theta)$ denote the likelihood. Define the likelihood ratio

\Lambda(x) = \frac{L(x \mid \theta_1)}{L(x \mid \theta_0)}.

For any constant $k > 0$ , the test that rejects $H_0$ iff $\Lambda(x) > k$ is the MOST POWERFUL test of its size — for any other size-α test $\phi'$ , the LR test has greater or equal power at $\theta_1$ . If you pick $k$ to make $P_{\theta_0}(\Lambda(X) > k) = \alpha$ , then the LR test achieves the maximum possible power among all size-α tests of $H_0$ vs $H_1$ .

Proof sketch (the classical version, see Lehmann & Romano 2005, Thm 3.2.1). Consider any other size-α test $\phi'$ . Let $\phi$ be the LR test, equal to 1 on the rejection region $R = {x : \Lambda(x) > k}$ and 0 elsewhere. On $R$ , $\phi - \phi' \ge 0$ ; off $R$ , $\phi - \phi' \le 0$ . Multiplying by $(L(x \mid \theta_1) - k L(x \mid \theta_0))$ (which is positive on R and negative off R by construction) gives $(\phi - \phi')(L_1 - k L_0) \ge 0$ pointwise. Integrating, $\int (\phi - \phi')(L_1 - k L_0), dx \ge 0$ , i.e., $(\text{power}(\phi) - \text{power}(\phi')) - k (\text{size}(\phi) - \text{size}(\phi')) \ge 0$ . Since size(φ) = size(φ') = α (the second test is also size-α), the LR test's power exceeds the alternative's. Done.

The lemma is breathtakingly clean for what it covers: simple-vs-simple, fixed sample, no nuisance parameters. The catch is that REAL hypothesis-testing problems are almost never simple-vs-simple. The null is typically a single point ("μ = 0") but the alternative is a SET ("μ > 0" or "μ ≠ 0"). Likewise, the test almost always has nuisance parameters ("unknown σ"). The lemma's power is not in its direct applicability but in the EXTENSIONS it bootstraps.

UMP tests, Karlin–Rubin, and the LRT generalisation

For a COMPOSITE alternative $H_1: \theta \in \Theta_1$ , the question is whether a single size-α test is most powerful at every $\theta \in \Theta_1$ simultaneously. A test with this property is called uniformly most powerful (UMP):

\text{power}_\phi(\theta) \;\ge\; \text{power}_{\phi'}(\theta) \quad \forall \theta \in \Theta_1, \;\forall \phi' \;\text{size-} \alpha.

UMP tests exist for SOME but not all problems. The cleanest existence result is the Karlin–Rubin Theorem (Karlin & Rubin 1956; see Lehmann-Romano 2005 §3.4): for a one-parameter exponential family with monotone likelihood ratio in a sufficient statistic $T$ , the test "reject if $T > c_\alpha$ " is UMP for any one-sided alternative $H_1: \theta > \theta_0$ against $H_0: \theta \le \theta_0$ . The one-sided z-test for a normal mean (known σ) is the canonical example; so is the one-sided test for a binomial proportion, a Poisson rate, an exponential rate parameter.

UMP tests DO NOT EXIST for most TWO-SIDED problems. For $H_0: \mu = 0$ vs $H_1: \mu \ne 0$ on normal data, no size-α test is most powerful at every nonzero μ — a test that maximises power at $\mu = -1$ (rejection region on the left tail) cannot also maximise power at $\mu = +1$ . Same for two-sided binomial, two-sided Poisson, etc. When UMP doesn't exist, two practical fallbacks:

UMP UNBIASED (UMPU). Restrict to tests whose power is everywhere ≥ α on $H_1$ (i.e., the test does at least as well as a coin flip). Within the restricted class, UMP tests often exist — the two-sided z-test is UMPU for the normal-mean problem (Lehmann-Romano 2005 §4.2).
GENERALISED LIKELIHOOD RATIO TEST (LRT). Replace the simple likelihood ratio $L(x \mid \theta_1)/L(x \mid \theta_0)$ with the ratio of MAXIMISED likelihoods over the null and the full parameter space: $\Lambda(x) = \sup_{\theta \in \Theta_0} L(x \mid \theta) ;/; \sup_{\theta \in \Theta_0 \cup \Theta_1} L(x \mid \theta)$ . Reject if $\Lambda(x)$ is small. The LRT is the universal recipe for composite-vs-composite problems and is the basis for the t-test, χ² test, F-test, and most of Part 2 §2.3. Under regularity, $-2 \log \Lambda(X)$ has an asymptotic $\chi^2_{p_1 - p_0}$ null distribution (Wilks 1938), which gives an asymptotic critical value with no further work.

The pragmatic take: for one-sided one-parameter problems, the UMP test is the answer (and it usually coincides with the obvious z- or t- or χ²-statistic). For everything else, the LRT is the default workhorse. §2.3 will work several classical tests out from the LRT machinery; for now you only need to know that an OPTIMAL test exists (UMP) for SOME problems, and a NEAR-OPTIMAL recipe (LRT) covers the rest.

The second widget makes the "UMP vs not" ordering visible. It plots three POWER FUNCTIONS $\beta(\mu)$ for tests of $H_0: \mu \le 0$ vs $H_1: \mu > 0$ on normal data, all at the same size α: the one-sided z-test (UMP for this problem by Karlin–Rubin), the sign test (which discards the magnitudes of the observations and tests only their signs, suboptimal for normal data), and the two-sided $|Z|$ -test (which wastes half its α budget on the wrong tail). The VERTICAL GAP between the curves is the power loss relative to the optimal test, in percentage points.

Things to verify:

At α = 0.05, n = 30: the z-test curve climbs from 0.05 at μ = 0 to 0.99 at μ = 1σ. The sign-test curve climbs from ~0.05 at μ = 0 to ~0.87 at μ = 1σ — visibly LOWER everywhere. This is the Pitman ARE 2/π ≈ 0.637 in pixels: the sign test would need ~ n/(2/π) ≈ n·π/2 ≈ 47 observations to match the z-test's power at n = 30.
At α = 0.05, n = 30, μ = 0.3σ (a small effect): z-test power ≈ 0.32, sign-test power ≈ 0.23, two-sided |Z|-test power ≈ 0.22. The gaps between curves are LARGEST in this intermediate-μ regime — the rank ordering is robust but the practical difference is mostly visible for moderate effect sizes.
Shrink α to 0.01. All three curves drop, but the z-test's strict-dominance ordering persists. The α–β trade-off coexists with the UMP ordering — they're orthogonal axes of test quality.
At μ = 0 (the null boundary): all curves sit at exactly α (for the continuous tests) or just below (for the discrete sign test — the exact size from the binomial floor). No test can violate its size by design; the differences live only on H₁.

Neyman–Pearson vs Fisher — and why modern practice confuses them

The framework above is purely Neyman–Pearson: pre-set α, design the rejection region, follow the decision rule. The output is a binary verdict (reject / fail to reject) and a long-run guarantee (in the imagined replication ensemble, you reject H₀ when it's true at rate α, regardless of $\theta_0$ ).

Fisher's SIGNIFICANCE TESTING is a different beast. In Fisher's framing (Statistical Methods for Research Workers, 1925), there is no formal H₁ and no pre-set α. You compute a test statistic, look up its tail probability under H₀ — what he called the p-value — and INTERPRET it as a measure of evidence against H₀. Small p-value → strong evidence against H₀; large p-value → no evidence against H₀ (notably, not the same as evidence FOR H₀). The output is a continuous strength-of-evidence number, NOT a binary decision.

The two frameworks are formally distinct and were historically at odds (the Neyman–Fisher feud is a well-documented chapter of 20th-century statistics; see Lenhard 2006 BJPS for a careful intellectual history). They differ on at least four axes:

Output type. N-P: binary decision. Fisher: continuous evidence summary.
Alternative. N-P: explicit $H_1$ , used to compute power. Fisher: no explicit $H_1$ ; the p-value depends only on $H_0$ .
Pre-set α. N-P: α is fixed in advance and protects long-run Type-I rate. Fisher: no fixed cutoff — 0.05 is a "convenient" reference, not a decision threshold.
Repetition imagined. N-P: long-run frequencies across hypothetical replications. Fisher: evidence in THIS dataset.

Modern textbook hypothesis testing is a confused HYBRID. The mechanics are Neyman–Pearson (fix α, compute a test statistic, compare to a critical value, declare reject or fail-to-reject). The INTERPRETATION is often Fisherian ("p = 0.03 is evidence against H₀"). The hybrid is internally inconsistent: if you're running a Neyman–Pearson test, the p-value's exact numerical value should not matter beyond whether $p \le \alpha$ ; if you're running a Fisher test, you shouldn't have pre-set α and you shouldn't be making binary accept/reject decisions. Hubbard & Bayarri (2003) is the definitive critique of this muddle.

§2.4 "What a p-value is — and what it is not" returns to this and dissects the consequences in detail. For §2.1 it is enough to flag the muddle: when you read "the result was significant (p < 0.05)" in a paper, you are reading a hybrid statement that conflates two frameworks. Knowing they're different is the first step to interpreting the literature honestly.

The "p < 0.05" convention — origin, arbitrariness, capture

Where did 0.05 come from? Fisher wrote in 1925 (Statistical Methods for Research Workers, ch. 3):

"The value for which P = 0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not."

That's it. A rough rule of thumb, justified by ARITHMETIC CONVENIENCE — at the era's table precision (pre-computer), the $z = 1.96 \approx 2$ rounding was a handy mental shortcut. Fisher was emphatic that 0.05 was a guideline, not a decision threshold, and explicitly recommended different thresholds in different contexts. Yet by the 1940s the convention had hardened into a near-universal cutoff, especially in psychology and biomedicine where Fisher's work was most influential.

What's arbitrary about 0.05:

Decimal artefact. 0.05 = 1/20 only because we have ten fingers. If we counted in base 12, the convention might be 1/24 ≈ 0.042. The number has no scientific or decision-theoretic content; it is a cultural artefact.
Asymmetric loss is unmodelled. The 0.05 vs 0.95 (α vs 1 − α) split treats false positives as exactly 19× more costly than false negatives. For a drug trial where a false positive sends a useless drug to market AND a false negative withholds a life-saver, the relative costs are problem-dependent and almost never 19:1.
The α-β coupling is invisible. Pre-setting α at 0.05 says nothing about power. A study with α = 0.05 and power = 0.20 (typical in some subfields, Button et al. 2013) gives p < 0.05 by truth-of-H₁ at most 20% of the time — the convention shapes the false-positive rate without addressing the much-larger false-negative rate.
Multiple comparisons silently inflate it. If you run 20 independent tests at α = 0.05, you expect ~1 false rejection from chance alone. The "p < 0.05" convention has no built-in protection against this; §2.5 will discuss FWER and FDR corrections that try to patch it.
Selective reporting amplifies it. If only "significant" results get published, the published p-value distribution is biased toward 0; the operational Type-I rate of the LITERATURE is much higher than the nominal 0.05 of the individual test. §2.6 (preregistration) and §2.8 (replication crisis) will close this loop.

What's psychologically captured about it: the binary "significant / not significant" cutoff turns a continuous evidence measure into a dichotomous accept/reject — and the dichotomous version is much easier for human cognition and journalism to process. "p < 0.05" became a publishable currency; "p = 0.06" became unpublishable, even though the underlying evidence is nearly identical. The American Statistical Association's 2016 statement on p-values (Wasserstein & Lazar 2016, Am Stat) is unusually blunt: "the widespread use of statistical significance (generally interpreted as ‘p ≤ 0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process". The 2019 follow-up (Wasserstein, Schirm & Lazar 2019, Am Stat) calls for "moving to a world beyond p < 0.05" outright.

The honest framing for §2.1: 0.05 is a convention. It is not magic. A p = 0.04 result and a p = 0.06 result carry essentially identical evidence; treating them as categorically different is a cognitive shortcut that breaks the underlying statistics. §2.4 will return to what p-values do and don't measure. Several modern guidelines (Benjamin et al. 2018 Nat Hum Behav proposed lowering to p < 0.005; Lakens et al. 2018 Nat Hum Behav instead recommended justified-α per problem) have proposed reforms; none has stuck universally. §2.1 just needs you to know the convention's origin so you can read "p < 0.05" in literature with appropriate skepticism.

Tests and confidence intervals are two views of the same machinery

One last connection before the recap: tests and confidence intervals are formally equivalent. A 100(1 − α)% confidence interval for θ is exactly the set of θ₀ values that a size-α test would NOT reject if applied as $H_0: \theta = \theta_0$ :

C(X) = \{\theta_0 : \theta_0 \text{ is not rejected by the size-}\alpha\text{ test of } H_0: \theta = \theta_0\}.

This is the test-inversion construction of a CI (see Casella-Berger 2002 §9.2.1). Reading it: "the CI is the set of nulls compatible with the data at level α". A 95% CI for μ excludes 0 if and only if the size-0.05 test of $H_0: \mu = 0$ rejects on this data. Tests and CIs are equivalent operations; the CI just packages the test's output as a SET of plausible parameter values rather than a binary verdict for a single null.

This equivalence is why Part 3 (confidence intervals) and Part 2 (testing) share so much machinery — they're the same math, organised around two different output formats. Modern recommendations (Wasserstein et al. 2019; Greenland et al. 2016) lean toward REPORTING CIs and effect-size estimates rather than just p-values, precisely because the CI carries strictly more information (every null in the complement of the CI is rejected at level α, and every null in the CI is not — so the CI summarises the entire family of tests, not just the one at $\theta_0 = 0$ ). The §2.1 framework you just built underlies both formats.

Try it

In the np-decision-regions widget, set α = 0.05, μ₁ = 0.5, n = 30 (the defaults). Read off power ≈ 0.58 from the green region area. Now compute by hand: $c_\alpha = 1.645/\sqrt{30} \approx 0.30$ ; β = Φ((0.30 − 0.5)/(1/√30)) = Φ(−1.10) ≈ 0.136. Wait — that's not right. Let me redo: $\beta = \Phi((c_\alpha - \mu_1)/(\sigma/\sqrt n)) = \Phi((0.30 - 0.50) \cdot \sqrt{30}) = \Phi(-1.10) \approx 0.136$ . So power ≈ 0.864? — That can't match the widget either. Re-examine. With σ = 1, SE = 1/√30 ≈ 0.183. $c_\alpha = 1.645 \cdot 0.183 \approx 0.301$ . $\beta = \Phi((0.301 - 0.500)/0.183) = \Phi(-1.087) \approx 0.139$ . So power ≈ 0.86 — that IS what the widget should show. Verify and reconcile. (If you got something different, you may have mis-set μ₁ to 0.5σ vs 0.5; the widget uses σ = 1 so the two coincide.)
Same widget, set α = 0.001, μ₁ = 0.5, n = 30. Compute c_α = z_{0.999}·SE = 3.090·0.183 ≈ 0.566. β = Φ((0.566 − 0.500)/0.183) = Φ(0.361) ≈ 0.641. Power ≈ 0.36. The α drop from 0.05 to 0.001 cost ~50 percentage points of power at this n. To recover, increase n.
Same widget, set α = 0.05, μ₁ = 0.5. Find the smallest n that gives power ≥ 0.80. Try n = 30 (power ≈ 0.86 — already there). Try n = 20 (power ≈ 0.71). Try n = 25 (power ≈ 0.79). Try n = 26 — about right. This is a classic SAMPLE-SIZE-FOR-POWER calculation, which §2.2 will formalise.
In the umpt-vs-not widget, set α = 0.05, n = 30, evaluation μ = 0.3. Read off z-test power ≈ 0.32, sign-test power ≈ 0.23. The gap of ~9 percentage points means using a sign test when a z-test was applicable WASTES roughly that much detection capacity. Compute the equivalent-power n for the sign test: 30/(2/π) ≈ 47. So the sign test on 47 observations matches the z-test on 30 — that's the 2/π Pitman ARE in pixels.
Same widget, drop α to 0.01. At μ = 0.5σ what is the power of each test? Notice the rank ordering (z > sign > |Z|) survives the α change; the GAPS change but the ORDERING does not, illustrating the UMP property holds at every size α.
Pen-and-paper. Prove the Neyman–Pearson Lemma for the special case of two normal alternatives: $H_0: \mu = 0$ vs $H_1: \mu = \mu_1 > 0$ on iid Normal(μ, σ²) data with σ known. Step 1: write the likelihood ratio $\Lambda(x) = \prod_i \phi(x_i - \mu_1)/\phi(x_i)$ . Step 2: take logs and simplify to show $\log \Lambda(x) = (\mu_1/\sigma^2) \sum x_i - n\mu_1^2/(2\sigma^2)$ . Step 3: conclude that {Λ > k} is equivalent to {Σ x_i > c} for some c, i.e. equivalent to {X̄ > c/n}. So the LR test reduces to a one-sided z-test on X̄. The lemma guarantees this is the most powerful size-α test of H₀ vs H₁.
Pen-and-paper. For iid Normal(μ, 1) data testing $H_0: \mu = 0$ vs $H_1: \mu \ne 0$ (two-sided composite), show that NO UMP size-α test exists. (Hint: any test most powerful at $\mu = +1$ must have its rejection region in the right tail of X̄. Any test most powerful at $\mu = -1$ must have its rejection region in the LEFT tail. A single size-α test cannot do both simultaneously.) Conclude that the two-sided z-test, which splits its α budget between the two tails, is a COMPROMISE — not optimal at either μ, but UMP within the class of UNBIASED tests (UMPU).
Pen-and-paper. Define the Pitman ARE of two tests as $\text{ARE}(T_1, T_2) = \lim_{n \to \infty} n_2(n) / n$ , where $n_2(n)$ is the sample size the second test needs to match the first test's power at the same (small) effect size and α. Look up (or derive from Pitman 1948) that ARE(sign, z) = 2/π for normal data. Interpret: to match a z-test's power, the sign test needs n × π/2 ≈ 1.57n observations. Now repeat for HEAVIER-THAN-NORMAL data (e.g., t with 3 degrees of freedom); the sign test's ARE vs the z-test can EXCEED 1 because the sign test is robust to heavy tails and the z-test is not. The optimality ordering depends on the population, not just the test.
Pen-and-paper. A drug trial pre-specifies α = 0.05 (two-sided) for a primary endpoint. The trial finds p = 0.06. The investigators say "directionally significant". A reviewer rejects the manuscript. Argue both sides. Where does the binary cutoff distort the underlying evidence? Where does the binary cutoff provide protection from p-hacking? (Connect to §2.6.)
Pen-and-paper. Construct a CONFIDENCE INTERVAL for the population mean μ from the size-α z-test on iid Normal(μ, σ²) data with σ known. Step 1: the test rejects $H_0: \mu = \mu_0$ at size α iff $|\sqrt n (\bar X - \mu_0)/\sigma| > z_{1-\alpha/2}$ . Step 2: the set of μ₀ NOT rejected is ${\mu_0 : |\sqrt n (\bar X - \mu_0)/\sigma| \le z_{1-\alpha/2}}$ . Step 3: solve for μ₀ to get $\bar X - z_{1-\alpha/2} \sigma/\sqrt n \le \mu_0 \le \bar X + z_{1-\alpha/2} \sigma/\sqrt n$ . This is the standard 100(1 − α)% Wald CI, recovered by inverting the size-α test. Part 3 will use this construction systematically.

Pause and reflect: §2.1 has framed hypothesis testing as a DECISION procedure with explicit ERROR RATES. The Neyman–Pearson lemma identifies the most-powerful test for simple-vs-simple problems; the Karlin–Rubin and LRT extensions cover most everything else. The α–β trade-off is fundamental, and more data is the only thing that escapes it. And yet the section has spent significant space ON the muddle between N-P decision-making and Fisher significance-testing, on the arbitrariness of α = 0.05, and on the cognitive capture of "significance". Why front-load the critique BEFORE doing the classical tests (§2.3) or analysing p-values formally (§2.4)? Because the math is clean and the practice is dirty — and if you only see the math you will reproduce the confused-hybrid pattern that put applied statistics where it is. The honest framework is: N-P provides a decision procedure with operating characteristics; α is a design choice not a discovery; power is the operationally useful quantity; the convention 0.05 is cultural not mathematical. §2.2 through §2.8 build out the consequences with that clear-headedness already in place.

What you now know

Hypothesis testing is a DECISION procedure, not a truth-finder. The Neyman–Pearson framework specifies: a null $H_0$ , an alternative $H_1$ , a test statistic $T(X)$ , a rejection region $R$ , and a decision rule "reject H₀ iff T(X) ∈ R". The two error rates are TYPE-I $\alpha = \sup_{H_0} P(R)$ (false positive, controlled by design) and TYPE-II $\beta(\theta) = P(\text{not } R \mid \theta), \theta \in H_1$ (false negative, a function of where the truth lies). POWER = 1 − β is the operationally useful quantity.

The Neyman–Pearson lemma gives the LIKELIHOOD-RATIO TEST as most powerful for simple-vs-simple problems. The Karlin–Rubin theorem extends this to one-sided tests in one-parameter exponential families (UMP exists). The generalised LRT covers composite-vs-composite problems by maximising the likelihood under H₀ and under H₀ ∪ H₁; under regularity, $-2 \log \Lambda \to_d \chi^2_{p_1 - p_0}$ asymptotically (Wilks 1938). The α–β trade-off is fundamental: at fixed n, shrinking α inflates β; growing n shrinks both but the coupling persists.

Two frameworks coexist confusedly in modern practice: Neyman–Pearson (pre-set α, decision, long-run error control) and Fisher significance testing (compute p-value, interpret as evidence strength, no pre-set cutoff). They differ on output type, role of $H_1$ , role of α, and frequentist interpretation. The modern textbook "reject at α = 0.05 because p < 0.05" is a hybrid that conflates them. §2.4 dissects the consequences. The 0.05 convention is a Fisher-1925 rule of thumb hardened into cultural law; its arbitrariness, its asymmetric-loss invisibility, its multiple-testing and selective-reporting vulnerabilities are the structural reasons §2.5, §2.6, and §2.8 exist.

Tests and confidence intervals are formally equivalent: a 100(1 − α)% CI is the set of $\theta_0$ NOT rejected by a size-α test. Reporting CIs and effect sizes carries strictly more information than reporting a single p-value at $\theta_0 = 0$ .

Where this lands in Part 2. §2.2 formalises POWER and SAMPLE-SIZE calculations — given (α, effect size, n), compute power; given (α, effect size, target power), compute required n. §2.3 does the t-test, χ²-test, and F-test from the LRT machinery built here. §2.4 dissects what a p-value is and is not, returning to the N-P / Fisher muddle. §2.5 handles multiple testing: FWER, FDR, and what "significant" means when you ran 20 tests. §2.6 preregistration and the garden of forking paths — the procedural fix for selective reporting. §2.7 equivalence testing (TOST), the cure for the "absence of evidence ≠ evidence of absence" fallacy. §2.8 the replication crisis — what happens when these structural failures compound across a literature. Every later section USES the framework built here; §2.1 is the load-bearing wall.

References

Neyman, J., Pearson, E.S. (1933). "On the problem of the most efficient tests of statistical hypotheses." Philosophical Transactions of the Royal Society A 231, 289–337. (The foundational paper. Introduces the lemma, the framework, the error-rate framing.)
Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh. (The book where p < 0.05 became cultural law. Chapter 3 has the famous "convenient to take this point as a limit" passage.)
Lehmann, E.L., Romano, J.P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer. (The graduate-level reference. Chapter 3 covers the N-P lemma and UMP tests; Chapter 4 covers UMPU; the Karlin-Rubin theorem is Thm 3.4.1. Chapter 13 on rank tests covers the sign-test Pitman ARE.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 8 on hypothesis testing; §8.3 develops UMP tests; §9.2 covers the test-inversion CI construction.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 10 is the cleanest one-chapter survey of testing for a working-statistician audience.)
Hubbard, R., Bayarri, M.J. (2003). "Confusion over measures of evidence (p's) versus errors (α's) in classical statistical testing." American Statistician 57(3), 171–178. (The definitive critique of the N-P/Fisher hybrid muddle. Required reading before §2.4.)
Wasserstein, R.L., Lazar, N.A. (2016). "The ASA's statement on p-values: context, process, and purpose." American Statistician 70(2), 129–133. (The 2016 ASA statement.)
Wasserstein, R.L., Schirm, A.L., Lazar, N.A. (2019). "Moving to a world beyond p < 0.05." American Statistician 73(sup1), 1–19. (The 2019 follow-up; recommends abandoning the dichotomous significance cutoff. Connects directly to §2.4 and §2.8.)
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum. (The classic on power and sample-size calculation. The 80% power benchmark is Cohen's. §2.2 will use this machinery extensively.)