Type-I, Type-II, power, and effect size

Part 2, Hypothesis testing without p-hacking

Learning objectives

Restate the Type-I rate $\alpha$ and Type-II rate $\beta(\theta)$ in §2.1 language and define POWER as $1 - \beta(\theta)$
State the four-variable closure: at a single design-stage, fixing any three of (α, β, effect size, n) determines the fourth
Compute power given (α, effect size, n) for the canonical research tests: two-sample t (Welch's), one-sample t, paired t, two-proportion z, one-way ANOVA, regression coefficient
Invert: compute the required sample size n for a fixed (α, target power, smallest effect size of interest), the a-priori sample-size calculation that should happen BEFORE data collection
Define standardised effect sizes: Cohen's $d = (\mu_1 - \mu_2)/\sigma_p$ , Pearson's $r$ , odds ratio $OR$ , Cohen's $h = 2\arcsin\sqrt{p_1} - 2\arcsin\sqrt{p_2}$ , Cohen's $f^2 = R^2/(1-R^2)$
Recall Cohen's (1988) conventional small/medium/large benchmarks for $d$ (0.2 / 0.5 / 0.8) and STATE the caveat: the benchmarks are HEURISTICS, calibrated to behavioural psychology, not universal across domains
Distinguish A-PRIORI power analysis (compute n at the design stage, useful) from POST-HOC power analysis (compute power for the observed effect after the fact, useless, often misleading; Hoenig & Heisey 2001)
Diagnose the UNDERPOWERED-STUDY problem: when $\beta \gg \alpha$ , a non-significant result is almost uninformative "absence of evidence is not evidence of absence" (Altman & Bland 1995)
Translate among $d \leftrightarrow r \leftrightarrow OR \leftrightarrow NNT$ using the Hasselblad-Hedges / Chinn (2000) and Borenstein et al. (2009) identities
Preview Bayesian alternatives: sequential analysis (Wald 1947), Bayes factors (Kass & Raftery 1995), and decision-theoretic power

§2.1 cast hypothesis testing as a DECISION procedure with two error rates: α (Type-I, false positive, chosen by design) and β(θ) (Type-II, false negative, a function of where in $H_1$ the truth lies). Power = 1 − β is the operationally useful summary, the probability that the test correctly rejects $H_0$ when $H_1$ is true. §2.2 takes those concepts and OPERATIONALISES them. By the end of this section you should be able to walk into a research-planning meeting, ask three questions, what effect size matters? what α can we live with? what power do we want?, and walk out with a sample-size estimate.

The §2.2 arc has five stops. First, the FOUR-VARIABLE CLOSURE, the structural fact that (α, β, effect size, n) are linked by one equation, so fixing any three determines the fourth. Second, EFFECT SIZES, what they are, why they matter, why Cohen's small/medium/large labels are heuristics rather than universal categories, and how to convert among d, r, OR, NNT. Third, POWER CALCULATIONS for the five most-common research tests, with the noncentrality parameter as the unifying gadget. Fourth, A-PRIORI vs POST-HOC power, the first is useful, the second is mathematically suspect and operationally misleading. Fifth, the UNDERPOWERED-STUDY problem, what it costs the literature when β systematically exceeds α. Two widgets thread the section.

The four-variable closure

In §2.1 we defined the size α of a test as $\alpha = P(\text{reject } H_0 \mid H_0 \text{ true})$ and the power at parameter value $\theta \in \Theta_1$ as

\text{power}(\theta) = 1 - \beta(\theta) = P(\text{reject } H_0 \mid \theta).

For a fixed test, the power is determined by three inputs:

The significance level α. Smaller α means a more conservative rejection region, lower false-positive rate, lower power at every $\theta \in \Theta_1$ .
The effect size. For testing a normal mean, the natural effect size is the standardised difference $d = (\mu_1 - \mu_0)/\sigma$ . Larger $|d|$ pushes the $H_1$ sampling distribution further from the $H_0$ distribution, so a fixed rejection boundary catches more of it.
The sample size n. Larger n sharpens both sampling distributions (their SD scales as $\sigma/\sqrt{n}$ ), reducing their overlap.

For the one-sided normal-mean problem with known σ, the explicit relationship is

1 - \beta = \Phi(d \sqrt n - z_{1 - \alpha})

where $\Phi$ is the standard-normal CDF and $z_{1-\alpha}$ is its $(1-\alpha)$ quantile. Reading: power is monotone increasing in $d$ , in $n$ , and in $\alpha$ ; given any three, the fourth is determined by the equation. This is the four-variable closure of design-stage power analysis. Each of the five research tests in this section has its own version of this formula (different noncentrality parameter, different reference distribution) but the structural fact is universal: α, β, effect size, n are four numbers tied by one equation; fix three and the fourth pops out.

The most common research use of the closure inverts (i): fix α (usually 0.05), fix target power (usually 0.80, following Cohen 1988), fix the SMALLEST EFFECT SIZE of practical interest (problem-specific), and solve for the required n. This is the A-PRIORI sample-size calculation that should be the FIRST quantitative step of any confirmatory study, long before data collection starts.

Effect sizes: the standardised currency of difference

An effect size is a NUMBER that captures the magnitude of the phenomenon you care about, on a scale that is comparable across studies. "Magnitude" depends on the structure of the test, for a comparison of two means it is a difference in means; for an association it is a correlation; for a binary outcome it is an odds ratio or risk difference. Each test type has a NATURAL effect size:

Cohen's d (mean comparisons). $d = (\mu_1 - \mu_2)/\sigma_{\text{pooled}}$ , the standardised mean difference. $d = 0.5$ means the two group means are half a within-group SD apart. The natural effect size for two-sample / one-sample / paired t-tests.
Pearson's r (correlation / association). The standardised covariance $r = \text{Cov}(X, Y)/(\sigma_X \sigma_Y) \in [-1, 1]$ . For a binary group indicator, the point-biserial correlation. Natural for correlation tests and simple linear regression coefficient tests.
Odds ratio (OR) (binary outcomes). $OR = (p_1/(1-p_1))/(p_2/(1-p_2))$ . Natural for case-control studies, logistic regression coefficients, two-by-two tables. Multiplicative scale: $OR = 2$ doubles the odds.
Cohen's h (proportion comparisons). $h = 2\arcsin\sqrt{p_1} - 2\arcsin\sqrt{p_2}$ , the variance-stabilising arcsine-root difference. Natural for two-proportion z-tests; $h$ is more interpretable than the raw difference $p_1 - p_2$ because it discounts differences near 0 or 1 (where the binomial variance is small).
Cohen's f² (regression / ANOVA). $f^2 = R^2_{\text{full}} - R^2_{\text{reduced}}$ divided by $(1 - R^2_{\text{full}})$ , i.e. the partial $R^2$ for the term being tested. Cohen's $f$ is its square root. Natural for ANOVA omnibus tests and regression coefficient tests.
Hedges' g (small-sample-corrected d). $g = d \cdot J(df)$ with $J(df) \approx 1 - 3/(4 \cdot df - 1)$ . The bias-corrected Cohen's d for studies with $n \le 50$ . Used in meta-analysis (Borenstein et al. 2009).

Two crucial properties of standardised effect sizes:

Scale-free. $d$ does not depend on the unit of measurement (centimetres vs metres, USD vs EUR). This lets you compare effects across studies with different instruments.
Translatable. Under standard normal-data assumptions, the metrics convert into each other via closed-form identities. The most useful (Borenstein et al. 2009, ch. 7; Chinn 2000):

r = d / \sqrt{d^2 + 4}, \qquad d = 2r/\sqrt{1 - r^2}.

d = \log(OR) \cdot \frac{\sqrt 3}{\pi}, \qquad OR = \exp\!\left(d \cdot \frac{\pi}{\sqrt 3}\right).

(The factor $\pi/\sqrt 3$ in the d ↔ OR conversion is Chinn's (2000) reduction; it comes from the logistic-distribution variance $\pi^2/3$ used as the latent-variable scale.) These identities matter for meta-analysis (Borenstein et al. 2009), when one study reports a t-test and another a 2×2 table, you cannot average them without putting them on a common effect-size scale first.

Cohen's benchmarks, useful, but not universal

Cohen (1988, ch. 2; Cohen 1992 Psychol Bull) proposed CONVENTIONAL benchmarks for the size of $|d|$ :

small: $|d| \approx 0.2$ . Detectable but easy to overlook by inspection. Examples Cohen gives: average height difference between 15- and 16-year-old girls.
medium: $|d| \approx 0.5$ . Large enough to be visible to the naked eye in a scatterplot or contingency table.
large: $|d| \approx 0.8$ . Visible without statistics.

Equivalent translations: $r \approx 0.10/0.30/0.50$ ; $h \approx 0.20/0.50/0.80$ ; $f \approx 0.10/0.25/0.40$ . Cohen (1988) was explicit that these were ROUGH HEURISTICS to anchor planning when the researcher had no domain-specific data on typical effect sizes, not labels to slap on findings. The labels nonetheless hardened into the literature as a kind of universal grading scale.

The benchmarks are DOMAIN-DEPENDENT and should be calibrated to the field:

Behavioural psychology / social. After publication-bias correction, the median replicated effect is $d \approx 0.2$ (Open Science Collaboration 2015 Science). Cohen's "medium" $d = 0.5$ is RARE in well-conducted psychology, Funder & Ozer (2019) argue Cohen's benchmarks are too lenient and propose $d = 0.05/0.10/0.20$ as more defensible thresholds for typical social-science work.
Clinical medicine. RCT effects for non-life-saving interventions cluster around $d = 0.20 - 0.30$ (Pereira & Ioannidis 2011 PLoS ONE). Effects of $d = 0.5$ are exceptional; effects of $d = 0.10$ are still clinically important when multiplied across a population.
Education. Hattie (2009) Visible Learning meta-analysed ~50,000 studies and reported a median educational-intervention effect of $d \approx 0.40$ with a "hinge point" of $d = 0.40$ for distinguishing effective from ineffective interventions. The number is widely used and widely criticised (the meta-meta-analysis collapses heterogeneous interventions).
Sociology / survey. Survey effects in absolute $d$ are typically small (0.05 to 0.20) but well-estimated due to large $n$ . Funder & Ozer's threshold $d = 0.10$ is now standard in social science as a "small but consequential" benchmark.
Particle physics / genome-wide association. The "large" effect for a single SNP in a GWAS is $OR \approx 1.1 \Leftrightarrow d \approx 0.05$ . Particle-physics analyses routinely target effects of $d \approx 0.001$ . "Large" is a function of the noise floor.

The first widget makes this explicit. Pick a domain, see the empirical distribution of effect sizes, and locate your design's effect on it.

Things to do in the widget:

Set d = 0.5 (Cohen's "medium"). Switch the domain to Medicine. Notice that d = 0.5 is ABOVE the 75th percentile of typical medical-RCT effects, it would be an unusually large effect in that domain, not a routine one.
Set d = 0.20. Switch among the four domains. In Psychology, this is the median (P50) effect; in Sociology, it's in the upper quartile; in Education, it's below the 25th percentile. The same number means very different things.
Use the translator panel: enter r = 0.30. Read off OR ≈ 3.32, d ≈ 0.63, NNT ≈ 3.6 at CER = 30%. The same "medium" r corresponds to a "large-ish" d and a clinically very dramatic NNT, the choice of metric shapes the rhetorical impact of a finding.
Enter OR = 2 (an effect routinely reported as "doubling the odds"). Read d ≈ 0.38, r ≈ 0.19, NNT ≈ 8.5. An OR of 2 sounds large; the equivalent d is between Cohen's small and medium and the NNT (need to treat 8-9 people for one extra outcome) is a moderate clinical benefit.

The five canonical research tests and their power formulas

Each of the standard research tests gives a closed-form (or near-closed-form) power expression, parameterised by the noncentrality of the test statistic's sampling distribution under $H_1$ . The unifying idea: under $H_0$ the test statistic follows a CENTRAL reference distribution (central t, central F, standard normal); under $H_1$ it follows a NONCENTRAL version of the same distribution, shifted by a noncentrality parameter $\lambda$ that depends on the effect size and the sample size. Power is then the tail probability of the noncentral distribution past the critical value of the central one.

Two-sample t-test (Welch's, balanced). Compare means of two independent groups, each of size $n$ , assumed normal with equal SD $\sigma$ . Effect size: Cohen's $d = (\mu_1 - \mu_2)/\sigma$ . Test statistic $T$ has $\nu = 2(n-1)$ degrees of freedom under $H_0$ and noncentrality $\lambda = d \sqrt{n/2}$ under $H_1$ . Power = $P(T_{\nu, \lambda} > t_{1-\alpha/2, \nu}) + P(T_{\nu, \lambda} < -t_{1-\alpha/2, \nu})$ (two-tailed). Quick rule of thumb (Cohen 1992): n ≈ 16/d² per group for 80% power at α = 0.05.
One-sample t-test. Compare a sample mean to a hypothesised value $\mu_0$ . $d = (\mu - \mu_0)/\sigma$ . $\nu = n - 1$ , $\lambda = d \sqrt n$ . Power formula same as above with the new $(\nu, \lambda)$ . Quick rule: n ≈ 8/d² for 80% power.
Paired t-test. Compare two paired/within-subject measurements. Reduces to a one-sample t-test on the differences. Effect size $d_z = \text{mean}(\text{diff})/\text{SD}(\text{diff})$ . NB the $d_z$ is OFTEN larger than the corresponding between-subjects $d$ because within-subject SD is usually smaller than between-subject SD, pairing typically buys power.
Two-proportion z-test. Compare two binomial rates $p_1, p_2$ from balanced groups of size $n$ each. Effect size: Cohen's $h = 2\arcsin\sqrt{p_1} - 2\arcsin\sqrt{p_2}$ (variance-stabilising). Under $H_1$ the test statistic is approximately $Z \sim N(h \sqrt{n/2}, 1)$ . Power = $\Phi(h\sqrt{n/2} - z_{1-\alpha/2}) + \Phi(-h\sqrt{n/2} - z_{1-\alpha/2})$ .
One-way ANOVA. Compare means of $k$ groups, each of size $n$ , total $N = kn$ . Effect size: Cohen's $f = \sigma_{\text{between}}/\sigma_{\text{within}}$ . Under $H_1$ the F-statistic follows a noncentral F with $\nu_1 = k - 1$ , $\nu_2 = N - k$ and noncentrality $\lambda = N f^2$ (Cohen 1988 §8.2.1). Power = $P(F_{\nu_1, \nu_2, \lambda} > F_{1-\alpha, \nu_1, \nu_2})$ . Quick rule: $N \approx 64/f^2$ for the omnibus test at α = 0.05, power = 0.80, k = 3 groups (so $\sim 22$ per group).

The widget below operationalises all five. Pick a test, set α, set the effect size, set $n$ , and read off the power AND the required $n$ for the Cohen-standard 80% target.

Reproducing Cohen (1992) Table 2 in the widget:

Two-sample t, α = 0.05 (two-tailed), d = 0.5, target power = 0.80. The widget reports required n ≈ 64 per group. Cohen 1992 Table 2 gives 64. ✓
Two-sample t, α = 0.05, d = 0.2 (small), target power = 0.80. Required n ≈ 393 per group. Cohen 1992: 393. ✓
Two-sample t, α = 0.05, d = 0.8 (large), target power = 0.80. Required n ≈ 26 per group. Cohen: 26. ✓
One-way ANOVA, α = 0.05, k = 4 groups, f = 0.25 (medium), target power = 0.80. Required n ≈ 45 per group. Cohen: 45. ✓
Two-proportion z, α = 0.05, h = 0.20 (small), target power = 0.80. Required n ≈ 197 per group. Cohen: 197. ✓

Things to verify in the power-calculator:

Power is monotone in n. Hold (α, d) fixed at (0.05, 0.50). Slide n from 10 to 1000, power climbs monotonically from ~0.18 to ~0.99 for the two-sample t.
The α-β trade-off persists at fixed n. Hold (d, n) fixed at (0.50, 30). Slide α from 0.10 to 0.001, power drops from ~0.55 to ~0.07. The α-β coupling from §2.1 made operational at the design stage.
Effect-size dominates n. Compare (d = 0.20, n = 30) → power ~0.10 vs (d = 0.80, n = 30) → power ~0.87. A 4× increase in effect size buys roughly 8× more power than a 4× increase in n at fixed d.
ANOVA omnibus needs more total N than the equivalent pairwise t. ANOVA with k = 4 groups at f = 0.25 needs n ≈ 45 per group (N = 180). A pairwise t-test of one comparison at d = 0.50 needs n ≈ 64 per group (N = 128). Multiple groups dilute power; this is one structural reason for the multiple-comparisons problem §2.5 will tackle.

A-priori vs post-hoc power, and why post-hoc is broken

Power calculations come in two flavours, and only one is methodologically sound:

A-priori (design-stage) power analysis. Performed BEFORE data collection. Input: target power (e.g. 0.80), α (e.g. 0.05), smallest effect size of practical interest (problem-specific, justified by domain knowledge or pilot data). Output: required n. This is the legitimate use case. Every confirmatory study should have one. Funders and IRBs increasingly require it.
Post-hoc (observed-effect) power analysis. Performed AFTER data collection, using the OBSERVED effect size as the input. Output: "the power of my study at the effect size I observed." This is methodologically broken (Hoenig & Heisey 2001 Am Stat).

Why post-hoc power is broken: the observed effect size is a random variable, and the observed power is a deterministic function of the p-value. Hoenig & Heisey (2001) show that post-hoc power and the p-value are MONOTONICALLY RELATED, a p-value of exactly 0.05 corresponds to a post-hoc power of exactly 0.50, regardless of test type or sample size. So reporting "observed power = 0.50" adds NO information beyond reporting "p = 0.05." The post-hoc power calculation is a redundant restatement of the p-value, dressed up as a power analysis.

The fallacy worsens when post-hoc power is invoked to defend a non-significant result: "my study had post-hoc power = 0.30, so a non-significant result is acceptable." This is circular, the low post-hoc power is just another way of saying the p-value was large. It does not address the relevant counterfactual question: "if the true effect had been the MINIMUM-OF-INTEREST size, would my study have detected it?" That requires an A-PRIORI calculation with a pre-specified minimum-of-interest, not the OBSERVED effect.

The honest framing post-hoc: report the confidence interval for the effect size. The CI directly says which effect sizes are compatible with the data, much more information than a single p-value or a single observed-power number. §2.7 (equivalence testing / TOST) provides the right statistical machinery when the research question is "is the effect negligibly small?".

The underpowered-study problem

An underpowered study is one designed (or accidentally configured) with $\beta$ much larger than $\alpha$ . Concretely: $\alpha = 0.05$ but power = 0.30, so $\beta = 0.70$ . Such a study is 14× more likely to commit a Type-II error than a Type-I error. Several consequences cascade through the literature:

Most published findings from underpowered fields are unreliable. Button et al. (2013) Nat Rev Neurosci estimated typical neuroscience studies have power around 21%. Combined with publication selection (only significant results get published), the positive-predictive value of a "significant" finding from such a literature is much LOWER than the nominal 95%, Button et al. estimated PPV ≈ 0.36 for typical neuroscience. The famous Ioannidis (2005) "Why most published research findings are false" paper makes the same argument in greater generality.
Effect-size estimates are INFLATED by selection. If only significant results get published, then conditional on significance the published effect size is biased upward, the "winner's curse" or Type-M (magnitude) error of Gelman & Carlin (2014) Perspect Psychol Sci. The published d = 0.5 from an underpowered study is, in expectation, much larger than the true effect; the true effect could be d = 0.1 or 0.2, with the published number drawn from the tail.
Replications fail systematically. The Reproducibility Project: Psychology (Open Science Collaboration 2015 Science) replicated 100 published psychology findings and found a median replicated effect roughly half the original, exactly the Type-M-inflation signature.
The wrong direction. Gelman & Carlin (2014) also flag the Type-S (sign) error: in low-power studies with selection, even the sign of the published effect can be wrong with non-trivial probability. A "significant" result of d = +0.3 in an underpowered study might correspond to a true effect of d = -0.1 (small, opposite direction) with probability 5-10%.

What an honest researcher does: do an A-PRIORI power calculation; if the required n is infeasible at the chosen effect size, REDESIGN, pick a within-subject design instead of between-subjects, recruit collaborators for a multi-site replication, use a more sensitive outcome measure, or admit the study cannot answer the question and propose a different one. Publishing an underpowered confirmatory study is a category of research misconduct in some fields now (Lakens 2014 Eur J Soc Psychol); even where it isn't, the asymmetric loss is real.

The HONEST report after a non-significant result: "the 95% CI for the effect was [-0.05, 0.40], so effects as large as 0.40 are still consistent with the data; we cannot rule out effects smaller than 0.10 either; the study is INCONCLUSIVE about effects of practically meaningful magnitude." That is much more informative than "p = 0.18, no effect."

Bayesian and sequential alternatives

The N-P / Cohen power machinery is one approach to design. Three alternatives appear in modern practice:

Sequential analysis (Wald 1947; Lakens 2014). Plan a study that stops as soon as a pre-specified evidence threshold is reached, with appropriate α-spending so that the overall Type-I rate is controlled. Useful when data accrue slowly and the cost of running the study to a fixed n is high. The pre-registered stopping rule is critical, "checking p every 10 subjects and stopping when p < 0.05" without α-spending inflates Type-I to 30-40% (Armitage et al. 1969).
Bayes factors (Kass & Raftery 1995; Rouder et al. 2009). Replace the binary reject/fail-to-reject with a continuous evidence ratio $BF_{10} = P(\text{data} \mid H_1)/P(\text{data} \mid H_0)$ . Pre-specify a target BF (e.g. 10) as the evidence threshold. Bayes factors automatically incorporate prior information about plausible effect sizes, Cohen's d-benchmarks would enter the prior, not a separate effect-size justification.
Decision-theoretic / loss-based design. Replace the implicit symmetric loss of the α = 0.05 convention with an explicit utility function over decisions and consequences. Optimal n is the one that maximises expected utility, sometimes the answer is a much smaller study, sometimes a much larger one, depending on the relative costs of false positives vs false negatives vs running the study at all. Berger (1985) Statistical Decision Theory is the textbook reference.

None of these supersedes the §2.2 machinery; they extend it. The Cohen-style power calculation remains the lingua franca of research design and the entry point that funders, IRBs, and most journals expect.

Try it

In the power-calculator, set test = two-sample t, α = 0.05, d = 0.50, n = 30 per group. Read off power. (Should be ≈ 0.48.) Now read off required n for 80% power. (Should be ≈ 64.) Verify against Cohen 1992 Table 2 row d = 0.50.
Same widget, switch to one-sample t at the same d = 0.50, α = 0.05. Required n for 80% power should be about HALF the two-sample answer (~ 34). Explain why a one-sample t needs roughly half the per-group n of a two-sample t at the same d: the one-sample noncentrality $d\sqrt n$ at n = 34 ≈ the two-sample $d\sqrt{n/2}$ at n = 64 ≈ 2.9; both yield power ≈ 0.80.
Same widget, ANOVA with k = 4 groups, α = 0.05, f = 0.25 (medium). Required n per group for 80% power should be ≈ 45 (total N ≈ 180). Compare to a pairwise t at d = 0.50 (equivalent to f for k = 2): per-group ≈ 64 (total N ≈ 128). ANOVA needs LARGER total N because it splits power across more groups.
Same widget, two-proportion z. Set p₁ = 0.40, p₂ = 0.50 (in spirit, the widget takes Cohen's h directly; here h ≈ 0.20). At α = 0.05, h = 0.20, target power = 0.80, the required n per group is ≈ 197 (Cohen 1992 Table 2). Now set h = 0.10 (a SMALL effect): required n jumps to ≈ 785 per group. Halving the effect quadruples the required n, the $\propto 1/d^2$ scaling at fixed power.
In the effect-size-translator, set d = 0.30, domain = Education. Read off the domain percentile. (P25 ≈ 0.25, P50 ≈ 0.40, so d = 0.30 is in the lower-quartile-to-median range, common in education.) Now switch domain to Medicine: same d = 0.30 is between P50 and P75, above-average for clinical RCTs.
Translator: set OR = 1.5. Read d (≈ 0.22), r (≈ 0.11), NNT at 30% CER (≈ 18). "OR = 1.5" sounds notable but is Cohen's small d.
Pen-and-paper. Derive the n ≈ 16/d² rule of thumb for two-sample t-tests. Use the normal approximation: power ≈ Φ( $d\sqrt{n/2} - z_{1-\alpha/2}$ ). Set α = 0.05 (two-tailed, z = 1.96), target power = 0.80 (z_β = 0.84). Solve $d\sqrt{n/2} - 1.96 = 0.84$ → $d\sqrt{n/2} = 2.80$ → $n = 2 \cdot (2.8/d)^2 = 15.7/d^2 \approx 16/d^2$ . ✓
Pen-and-paper. A pilot study with n = 20 per group finds d̂ = 0.40 with p = 0.21 (non-significant). Compute the OBSERVED post-hoc power (with d̂ = 0.40, n = 20). Is the answer ~0.50 close to it? Use the power-calculator. Now use a-priori reasoning: if the MINIMUM effect of practical interest is d = 0.30, what n would have been required? Compare to the n = 20 you had. The a-priori calculation tells you what the pilot was actually powered to detect.
Pen-and-paper. A drug trial pre-registers α = 0.025, target power = 0.90, smallest clinically important effect d = 0.20. Required n per group? Use the rule of thumb adapted: n ≈ $2 (z_{1-\alpha/2} + z_{1-\beta})^2 / d^2 = 2 (2.24 + 1.28)^2 / 0.04 = 2 \cdot 12.4 / 0.04 \approx 620$ . Verify in the widget. With a budget for n = 200 per group, what power do you actually have? (≈ 0.45, half what was specified.)
Pen-and-paper. Open Science Collaboration (2015) found that of 100 psychology studies replicated, the median replication effect was about half the original effect. Argue why this is a SYSTEMATIC consequence of underpowered studies + publication selection (Type-M error of Gelman & Carlin 2014), not a moral failing of the original researchers. Then identify the design-stage fix.

Pause and reflect: §2.2 has converted §2.1's α-and-β concepts into actionable research-planning numbers. The four-variable closure says you cannot choose α, β, effect size, and n independently, once you fix three, the fourth is determined. The Cohen d benchmarks are useful heuristics but domain-dependent, "medium" in education is not "medium" in medicine, and treating the benchmarks as universal labels has contributed to a generation of underpowered psychology studies. A-priori power analysis is the legitimate use case; post-hoc power is a redundant restatement of the p-value and should not enter a discussion of a non-significant result. The next section, §2.3, descends from these abstractions to the actual mechanics of computing the t-, χ²-, and F-statistics by hand for the canonical research tests.

What you now know

Power = $1 - \beta$ is the probability a test rejects $H_0$ when $H_1$ is the truth at a given effect size. The four-variable closure says (α, β, effect size, n) are tied by one equation; fix three, the fourth is determined. A-priori power analysis fixes α, target power, and smallest-effect-of-interest, and solves for required n, the foundational step of any confirmatory study design.

Effect sizes are standardised currencies of difference. Cohen's $d$ for mean comparisons, Pearson's $r$ for associations, odds ratio $OR$ for binary outcomes, Cohen's $h$ for proportions, Cohen's $f^2$ for regression. They translate via closed-form identities (Borenstein et al. 2009; Chinn 2000). Cohen's (1988) small/medium/large benchmarks (d = 0.2/0.5/0.8) are HEURISTICS calibrated to behavioural psychology, empirical effect-size distributions vary widely by domain, and the benchmarks should be re-calibrated to the field (Funder & Ozer 2019).

The five canonical research tests, two-sample t, one-sample t, paired t, two-proportion z, one-way ANOVA, all share a noncentrality-parameter structure: under $H_0$ the test statistic follows a central reference distribution, under $H_1$ a noncentral version of the same. Power is the noncentral-distribution tail past the critical value. Quick rules: two-sample t needs $n \approx 16/d^2$ per group for 80% power at α = 0.05; ANOVA omnibus needs $N \approx 64/f^2$ .

Post-hoc power computed with the observed effect is mathematically a restatement of the p-value (Hoenig & Heisey 2001) and adds no information. Reporting the confidence interval for the effect is the right way to characterise a non-significant result. Underpowered studies cause $\beta \gg \alpha$ , inflate published effect sizes (Type-M error; Gelman & Carlin 2014), and produce the replication failures documented in Open Science Collaboration (2015) and Button et al. (2013).

Where this lands in Part 2. §2.3 computes the t-, χ²-, and F-statistics by hand for the canonical research tests, mapping the abstract power formulas of §2.2 onto specific test mechanics. §2.4 dissects what a p-value is and is not, connecting the N-P / Fisher muddle from §2.1 to the post-hoc-power abuse covered here. §2.5 handles multiple testing, when an ANOVA flags 4 groups as different and you do 6 pairwise comparisons, the family-wise error rate explodes; FWER and FDR corrections are the patches. §2.6 preregistration: codify the a-priori power analysis and the analysis plan BEFORE data collection. §2.7 equivalence testing: the proper machinery for asserting absence of an effect of meaningful size. §2.8 the replication crisis: the literature-scale consequences when these tools are misused for a generation.

References

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum. (The foundational reference. Chapter 2 introduces the small/medium/large d-benchmarks; chapter 8 covers ANOVA power with Cohen's f; appendix tables give required n for every test.)
Cohen, J. (1992). "A power primer." Psychological Bulletin 112(1), 155-159. (The short version: tables of required n for each common test at α = 0.05 and target power 0.80, indexed by effect size. The widget reproduces these numbers.)
Faul, F., Erdfelder, E., Lang, A.-G., Buchner, A. (2007). "GPower 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences." Behavior Research Methods 39(2), 175-191. (The standard software implementation of Cohen-style power analysis; the widget covers the same calculations a GPower user would run for the basic tests.)
Hoenig, J.M., Heisey, D.M. (2001). "The abuse of power: The pervasive fallacy of power calculations for data analysis." American Statistician 55(1), 19-24. (The definitive critique of post-hoc / observed-effect power analysis. Required reading before invoking power to defend a non-significant result.)
Button, K.S., Ioannidis, J.P.A., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S.J., Munafò, M.R. (2013). "Power failure: why small sample size undermines the reliability of neuroscience." Nature Reviews Neuroscience 14(5), 365-376. (Median estimated power in neuroscience ≈ 21%; positive predictive value of a significant finding in that literature is correspondingly poor.)
Gelman, A., Carlin, J. (2014). "Beyond power calculations: Assessing Type-S (sign) and Type-M (magnitude) errors." Perspectives on Psychological Science 9(6), 641-651. (Introduces Type-S and Type-M errors. Critical companion to Hoenig & Heisey for analysing what underpowered studies actually produce.)
Open Science Collaboration (2015). "Estimating the reproducibility of psychological science." Science 349(6251), aac4716. (The 100-replication psychology study. Median replicated effect ≈ half the original, the Type-M signature of underpowered + selection.)
Funder, D.C., Ozer, D.J. (2019). "Evaluating effect size in psychological research: Sense and nonsense." Advances in Methods and Practices in Psychological Science 2(2), 156-168. (Argues Cohen's benchmarks are too lenient for psychology and proposes d = 0.05/0.10/0.20 as more defensible thresholds. Empirical justification for the domain-dependent caveat.)
Borenstein, M., Hedges, L.V., Higgins, J.P.T., Rothstein, H.R. (2009). Introduction to Meta-Analysis. Wiley. (The standard reference for effect-size conversions, including the d ↔ r identity and small-sample Hedges' g correction. Used internally by the translator widget.)
Chinn, S. (2000). "A simple method for converting an odds ratio to effect size for use in meta-analysis." Statistics in Medicine 19(22), 3127-3131. (The d = log(OR)·√3/π identity used in the widget's metric translator.)
Lehmann, E.L., Romano, J.P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer. (Chapter 5 covers the t-test power calculation via the noncentral t-distribution; chapter 7 covers F-test power via noncentral F. The theoretical underpinning of every formula in this section.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (§8.3 develops the power function in the textbook idiom used here. Excellent end-of-chapter exercises on power calculations.)
Wasserman, L. (2004). All of Statistics. Springer. (Chapter 10 §10.4 on power; chapter 11 on Bayesian inference for the Bayesian-alternatives section.)