Equivalence testing and TOST

Part 2 — Hypothesis testing without p-hacking

Learning objectives

State the ASYMMETRY of classical NHST: a non-significant p-value does NOT support H₀; failure to reject is NOT evidence for the null. 'Absence of evidence is not evidence of absence' (Altman & Bland 1995, BMJ)
Name the three classes of research question that REQUIRE supporting the null rather than rejecting it: (i) bioequivalence trials (generic drug vs brand), (ii) non-inferiority trials (new treatment ≥ old treatment by some margin), (iii) replication studies where the goal is to demonstrate effect is small enough to be ignorable
Define the EQUIVALENCE MARGIN δ: the symmetric interval [−δ, +δ] of mean differences that are 'practically equivalent' for the purpose at hand. δ must be PRE-SPECIFIED before the data, based on subject-matter knowledge — not extracted from the data
State the TOST (Two One-Sided Tests) procedure of Schuirmann (1987): test H₀: |μ₁ − μ₂| ≥ δ vs H₁: |μ₁ − μ₂| < δ as the conjunction of TWO one-sided tests at α — (i) is μ₁ − μ₂ > −δ? (ii) is μ₁ − μ₂ < +δ? — and conclude equivalence iff BOTH reject
Derive the WESTLAKE (1976) / BERGER-HSU (1996) CI equivalence: TOST rejects H₀ at level α iff the 100(1 − 2α)% confidence interval for μ₁ − μ₂ lies entirely inside (−δ, +δ). The CI must FIT INSIDE the margin
Explain why TOST controls Type-I at α (not 2α): rejecting the equivalence-null requires BOTH one-sided tests to reject, which is an intersection-union construction; the worst-case Type-I rate over the composite null is α (Berger & Hsu 1996, Stat Sci)
Compute the FDA bioequivalence margin: log(AUC) ratio within log(0.80, 1.25), i.e., the equivalence margin is ±log(1.25) ≈ ±0.223 on the log scale (FDA 2003 guidance, §III); this is the canonical worked δ in industry
Distinguish EQUIVALENCE (symmetric: |Δ| < δ) from NON-INFERIORITY (one-sided: Δ > −δ — new at least as good as old, less the margin); both are TOST-family but non-inferiority drops the +δ side
Use POWER for equivalence testing: power depends on the true Δ (closer to 0 = more power), σ, n, δ, and α. Larger δ = more power. The classical Schuirmann (1987) power formula uses the non-central t with non-centrality parameters at ±δ/se
Recognise BAYESIAN alternatives: ROPE-vs-posterior (Kruschke 2018, Adv Methods Pract Psychol Sci) decides equivalence by whether the 95% HDI of the posterior for Δ lies inside [−δ, +δ]; Bayes factor for equivalence-vs-difference model is the model-comparison cousin
List the HONEST CAVEATS: the choice of δ is the most important step (bad δ = useless test); equivalence testing has its own multiple-comparisons issues; the procedure does NOT prove exact equality, only 'within the pre-specified margin'; underpowered equivalence tests fail to reject the equivalence-null even when truly equivalent
Connect to §2.4 (p-values) and §2.6 (preregistration): δ MUST be preregistered. Picking δ post-hoc to obtain the desired equivalence verdict is the equivalence-test analogue of p-hacking. Lakens (2017) calls this out explicitly
Preview §2.8 (the replication crisis): one major contributor to the crisis is the mass misuse of 'p > 0.05' as evidence of no effect, which equivalence testing was designed to replace

§2.1–§2.6 have built the NHST framework — the Neyman–Pearson decision rule, the t-test machinery, the p-value, multiple-testing corrections, and the preregistration discipline that locks the analysis path. All of that machinery answers ONE question: given the data, can we reject the null hypothesis that there is no effect? A REJECTION of H₀ supports the existence of an effect. But what does FAILURE to reject mean? In the strict Neyman–Pearson framework, the only honest answer is "we do not have enough evidence to reject H₀ at level α" — it does NOT mean the null is true. Yet the entire applied literature is full of statements that conflate the two: "no significant effect" reported as "no effect exists". That conflation is the wrong-direction inference; Altman and Bland (1995, BMJ) named it precisely: "absence of evidence is not evidence of absence".

This section is the fix. When the research question is genuinely "is the effect small enough to be ignorable?" — bioequivalence (does the generic drug perform like the brand?), non-inferiority (is the new treatment at least as good as the old, minus some clinical margin?), replication (was the original finding meaningfully large?) — the right machinery is EQUIVALENCE TESTING, not NHST. Equivalence testing inverts the null and the alternative: H₀ is "the effect is at least as large as some pre-specified margin δ" and H₁ is "the effect is smaller than δ". Rejecting H₀ now licenses the SUPPORTIVE conclusion "the effect is within the margin", which is exactly what bioequivalence regulators and non-inferiority trialists need.

The §2.7 arc has six stops. First, the ASYMMETRY of NHST and why "p > 0.05" cannot license "no effect". Second, the THREE practical question classes that demand equivalence machinery: bioequivalence, non-inferiority, replication. Third, the TOST (Two One-Sided Tests) PROCEDURE of Schuirmann (1987) — the canonical equivalence test for the two-sample mean problem. Fourth, the WESTLAKE–BERGER–HSU CI equivalence: TOST is geometrically the requirement that the 100(1 − 2α)% CI fits inside the equivalence margin. Fifth, two widgets that make all of this draggable: the tost-explorer for the single-dataset picture, and the equivalence-vs-superiority 2 × 2 that lays bare why these two tests answer ORTHOGONAL questions. Sixth, HONEST CAVEATS: the load-bearing role of δ, the post-hoc δ-hacking analogue, the Bayesian alternative, and the connection to the §2.8 replication crisis.

The asymmetry of NHST: why "p > 0.05" is not evidence of no effect

The Neyman–Pearson framework controls the Type-I error rate — the probability of rejecting H₀ when H₀ is true — at the pre-specified level α (§2.1). It does NOT, in general, control the Type-II error rate (the probability of failing to reject when H₁ is true) without further design — that is the role of POWER (§2.2). The asymmetry is built into the construction. A small p-value is evidence AGAINST H₀; a large p-value is silence — it can mean H₀ is true, or it can mean H₀ is false but the study is under-powered to detect the effect.

The clinical concrete case (Altman & Bland 1995, BMJ). A trial of 100 patients reports a treatment effect of d = 0.30 with p = 0.18 (n.s.). The investigators conclude "no significant effect of the treatment". A reader translates this to "the treatment does not work". But the 95% CI for d in that trial spans approximately [−0.10, +0.70] — entirely consistent with everything from a trivially-small benefit to a clinically-large benefit. The trial is uninformative about whether the treatment works, NOT a demonstration that it does not.

The literature is dense with this conflation. Hoekstra et al. (2006, Psychon Bull Rev) reviewed 40 papers reporting non-significant tests; 78% interpreted them as evidence of no effect. The misinterpretation is so common that "absence of evidence is not evidence of absence" is now a stock phrase in clinical-statistics textbooks (Senn 2007, ch. 12). The fix is procedural, not just rhetorical: if the research question is "is the effect small enough to ignore?", you need machinery DESIGNED to answer that question — equivalence testing.

Three settings that demand equivalence machinery

The three canonical use cases are well-defined and distinct.

Bioequivalence trials. A generic-drug manufacturer must show that the generic's pharmacokinetics (typically AUC and C_max for the active ingredient) are equivalent to the brand-name product's. The FDA (2003, Bioavailability and Bioequivalence Studies) specifies the equivalence margin on the LOG scale: the 90% CI for the log-ratio log(AUC_generic / AUC_brand) must lie entirely within log(0.80, 1.25). On the natural scale that is the 80%/125% margin — asymmetric on the natural scale because logs of asymmetric ratios are symmetric (log(1/0.80) = log(1.25)). The 90% CI corresponds to a per-side α of 0.05 — TOST at α = 0.05 each side.
Non-inferiority trials. A new treatment is compared to an established standard. The aim is not "the new treatment is better" (that would be a superiority trial), but "the new treatment is not meaningfully worse than the standard, given some pre-specified non-inferiority margin δ". Common in cardiology, oncology, and infectious-disease trials where the standard is effective and the new treatment offers a benefit on another axis (safety, convenience, cost). The CONSORT extension for non-inferiority (Piaggio et al. 2012, JAMA) standardises the reporting.
Replication studies and "smallest effect of interest". Original study reports d = 0.40, p = 0.04. A replication aims to demonstrate the original effect is reproducible. If the replication reports d = 0.05, p = 0.20, the original's claim is suspect — but a CI-only argument is insufficient. The cleaner inference is via equivalence to a SMALLEST EFFECT SIZE OF INTEREST (SESOI, Lakens 2017, Soc Psychol Pers Sci): pre-specify δ as the smallest effect that would be theoretically or practically meaningful, then run TOST on the replication. If TOST rejects, the replication has shown the effect is smaller than the SESOI — i.e., the original's claim was over-stated.

In all three cases the inference is "the effect is within a pre-specified margin". The NHST machinery cannot deliver this conclusion. TOST can.

The TOST procedure

Schuirmann (1987, J Pharmacokinet Biopharm) introduced TOST as the procedure that controls Type-I at α when the goal is to establish equivalence. The setup is the standard two-sample location problem: independent samples $X_1, \ldots, X_{n_X}$ from a population with mean $\mu_X$ and $Y_1, \ldots, Y_{n_Y}$ from a population with mean $\mu_Y$ , common variance $\sigma^2$ . The parameter of interest is $\Delta = \mu_X - \mu_Y$ . A symmetric equivalence margin $\delta > 0$ is PRE-SPECIFIED.

The composite hypotheses are inverted from the NHST setup:

Equivalence H₀ (composite): $|\Delta| \geq \delta$ . The effect is at LEAST as large as the margin.
Equivalence H₁ (composite): $|\Delta| < \delta$ . The effect is SMALLER than the margin.

Schuirmann's insight is that the composite H₀ decomposes into two one-sided pieces:

Lower test. $H_0^L: \Delta \leq -\delta$ vs $H_1^L: \Delta > -\delta$ . Reject at level α if $t_L = (\bar{X} - \bar{Y} + \delta) / se > t_{1-\alpha, df}$ .
Upper test. $H_0^U: \Delta \geq +\delta$ vs $H_1^U: \Delta < +\delta$ . Reject at level α if $t_U = (\bar{X} - \bar{Y} - \delta) / se < t_{\alpha, df} = -t_{1-\alpha, df}$ .

Reject the EQUIVALENCE H₀ — i.e., conclude equivalence — iff BOTH the lower and upper tests reject. The standard error and degrees of freedom are exactly those of the §2.3 pooled two-sample t-test: $se = s_p \sqrt{1/n_X + 1/n_Y}$ with $df = n_X + n_Y - 2$ . (Welch versions exist for unequal variances; the Schuirmann original assumes common σ.)

Why two one-sided tests at α each give an overall Type-I rate of α (and not 2α). The composite null $|\Delta| \geq \delta$ is the UNION of two one-sided nulls. Rejecting the composite requires BOTH one-sided rejections, which is the INTERSECTION of the rejection events. Under any specific $\Delta_0$ with $|\Delta_0| \geq \delta$ , at most one of the one-sided nulls is approximately true and the other is wildly false — the type-I rate is bounded by the rate of the wrong-side test, which is α. This is the classical INTERSECTION-UNION TEST (IUT) argument, formalised by Berger and Hsu (1996, Stat Sci): an IUT formed from level-α tests of each piece of a composite null has overall level α. That is the structural reason TOST does not need a Bonferroni-type correction.

The Westlake–Berger–Hsu confidence-interval characterisation

The cleanest way to think about TOST is geometrically. Westlake (1976, Biometrics) and Berger & Hsu (1996, Stat Sci) proved the equivalence-of-procedures result: TOST at level α rejects H₀ if and only if the $100(1-2\alpha)%$ confidence interval for $\Delta$ lies entirely inside the equivalence margin $(-\delta, +\delta)$ .

The number $1 - 2\alpha$ is not a typo. At per-side α = 0.05, the CI you check is the 90% CI (not 95%). This is the standard convention in bioequivalence: regulators check the 90% CI for the log-ratio against the (log 0.80, log 1.25) margin. The 90% comes from TOST at α = 0.05 per side. A 95% CI is more conservative than needed and a 99% CI is much more conservative still; the canonical 90% CI is the one that maps exactly onto TOST at α = 0.05.

The proof sketch (Berger & Hsu 1996, Theorem 2). The lower TOST rejects iff $t_L = (\bar{X} - \bar{Y} + \delta) / se > t_{1-\alpha, df}$ , equivalently $\bar{X} - \bar{Y} - t_{1-\alpha, df} \cdot se > -\delta$ , i.e., the lower endpoint of the $100(1-2\alpha)%$ CI exceeds $-\delta$ . Symmetrically, the upper TOST rejects iff the upper endpoint of the same CI is below $+\delta$ . Both rejections together iff the entire CI sits inside $(-\delta, +\delta)$ . The visual picture this gives is the heart of the first widget.

Watching TOST run on a dataset — the tost-explorer

The first widget makes the Westlake–Berger–Hsu picture interactive. Pick the pre-specified equivalence margin δ, the true difference Δ that governs the simulation, per-group n, within-group σ, and per-side α. The widget simulates ONE dataset under those settings, runs the two one-sided t-tests, and draws the 100(1 − 2α)% CI against the equivalence-margin band. The verdict is colour-coded — GREEN when the CI fits inside, RED when it lies entirely outside, YELLOW when it straddles the boundary.

Things to verify in the widget:

Start at the defaults (δ = 1.0, Δ = 0, n = 30, σ = 2, α = 0.05). The CI for the mean difference will usually be wider than ±1.0, so the verdict tends to be YELLOW — equivalence INCONCLUSIVE. This is the underpowered-but-not-clearly-different state. Now slide n up to 200. The CI shrinks; at some point it fits inside (−1, +1) and the verdict turns GREEN. That transition is the equivalence-trial sample-size logic: the trial must be powered to deliver a CI narrower than the margin band.
Set δ = 0.5 and n = 30. With true Δ = 0 the CI is the same width as before, but the margin is half as wide, so the verdict stays YELLOW even at n = 60. Larger n needed at smaller δ. Now set Δ = 0 and crank n to 300. The CI eventually fits inside (−0.5, +0.5) and turns GREEN. This is the bioequivalence design problem in miniature: a tight margin DEMANDS a large sample.
Set Δ = 1.5, δ = 1.0, n = 30. The CI sits centred near 1.5 — entirely outside the margin band. Verdict RED: the data argue the effect is LARGER than δ, so equivalence is not just unsupported but contradicted. This is the failure mode the regulator catches.
Re-roll the dataset a few times at fixed (Δ, δ, n, σ, α). The verdict can flicker between YELLOW and GREEN — the CI is a random interval. The TOST p-values flicker correspondingly. The Type-I and Type-II rates over many such draws are exactly the quantities Schuirmann's 1987 power formula governs.
Click the α radios. At α = 0.025 (the per-side 95% CI version) the CI widens to 95% — harder to fit inside the margin. At α = 0.10 the CI is the 80% one — easier to fit inside. Higher per-side α makes the test more liberal in the equivalence direction. The standard regulatory convention is α = 0.05 (so 90% CI), but software defaults vary.

Equivalence and superiority are orthogonal questions

The single most common misunderstanding of equivalence testing is to treat it as the opposite of superiority testing. It is not. They answer DIFFERENT questions about the same parameter Δ, and either answer can be yes or no independently of the other. The 2 × 2 table:

	Superiority: reject H₀	Superiority: do not reject
Equivalence: GREEN	Small but real effect	Truly no difference
Equivalence: NOT	Large clear effect	Underpowered — unclear

The four interpretive cells are mutually exclusive and jointly exhaustive. Each corresponds to a different sort of result a researcher might report:

Superiority YES + equivalence YES — "small but real". Both tests reject. The effect is detectably non-zero AND within the equivalence margin. The bioequivalence case where the generic differs slightly from the brand but the difference is below clinical relevance.
Superiority NO + equivalence YES — "truly no difference". Cannot distinguish from zero AND the CI fits inside the margin. The cleanest possible verdict in the SESOI-replication setting.
Superiority YES + equivalence NO — "large, clear effect". The standard rejection of H₀ that NHST is designed for. Equivalence is unsupported.
Superiority NO + equivalence NO — "underpowered". Neither test rejects. This is the dangerous corner: the literature's "no significant difference" papers often live here and mis-interpret themselves as belonging to the "truly no difference" cell. The CI is too wide to land in either zone.

The second widget runs both procedures on the same simulated dataset and lights up the active quadrant. The Monte Carlo replicates panel underneath samples K independent datasets at the same (Δ, n, σ, δ, α) and tallies the empirical frequency of each quadrant — so the reader can see, for example, that a Δ = 0, n = 12 design lands in the UNDERPOWERED cell ≈ 90% of the time, while the same Δ = 0 at n = 100 with a generous δ lands in TRULY NO DIFFERENCE ≈ 70% of the time.

Things to verify in the widget:

Click the "Underpowered (Δ ≈ 0, small n)" preset. Δ = 0, σ = 2, n = 12, δ = 0.5, α = 0.05. The active quadrant is almost always UNDERPOWERED. The Monte Carlo table at the bottom shows ~ 90%+ of replicates landing in the NN cell. Even though the truth is Δ = 0 (i.e., the populations are exactly equivalent), the design cannot deliver the equivalence verdict because the CI is wider than the margin.
Click "Truly equivalent (Δ = 0, big n)". Same Δ = 0, but now n = 100, δ = 1.0. The CI is much narrower and the margin is wider, so the GREEN equivalence verdict fires in the majority of replicates. The empirical frequency of the EN cell (equivalence YES + superiority NO) climbs to ~ 80%+ — the canonical "truly no difference" detection.
Click "Small but real". Δ = 0.3, σ = 1, n = 100, δ = 1.0. Now BOTH tests tend to fire: the true effect is detectably non-zero (superiority rejects) AND inside the margin (TOST rejects). The active quadrant is EE — "small but real". The Monte Carlo replicates concentrate in EE.
Click "Large clear effect". Δ = 1.5, n = 40, δ = 0.5. The effect is much bigger than the margin: superiority always rejects, TOST never does. The replicates pile into the NE corner — large clear effect.
Click "Bioequivalence-style". Δ = 0.05, σ = 0.5, n = 24, δ = 0.22 (a numerical analogue of the FDA log(0.8, 1.25) margin on a small scale). The result depends on the seed — typically a mix of EN and NN. This illustrates the genuine difficulty of bioequivalence design: a tight margin demands a careful n, and even at the chosen n the verdict is probabilistic, not deterministic.
Re-roll datasets a few times. Watch the active quadrant shift. The Monte Carlo proportions are more stable than the single-dataset verdict, exactly because they average over the sampling noise — the same reason a designed bioequivalence trial reports the CI, not just the point estimate.

The FDA bioequivalence margin — the worked δ

The most concrete example of an equivalence margin in industrial use is the FDA bioequivalence margin (FDA 2003, Bioavailability and Bioequivalence Studies for Orally Administered Drug Products). Two pharmacokinetic parameters are typically tested: the area under the concentration-time curve (AUC) and the maximum concentration (C_max). The test is on the log-ratio: $\theta = \log(\mathrm{param}$ .

The 90% CI for $\theta$ must lie within $[\log(0.80), \log(1.25)] \approx [-0.223, +0.223]$ . The asymmetric-on-the-natural-scale boundaries (80%/125%) are SYMMETRIC on the log scale because $\log(1/0.80) = \log(1.25) \approx 0.223$ . This symmetry is exactly why the log scale is the working scale in bioequivalence: the equivalence margin is symmetric, the TOST machinery applies directly, and the multiplicative interpretation ("the generic differs from the brand by no more than 20%/25%") matches the pharmacological intuition.

The 90% CI corresponds to per-side α = 0.05 — Schuirmann's original convention and still the FDA standard. The design implication: the trial must be large enough that the 90% CI for the log-ratio fits inside ±0.223 with adequate power, given the residual standard deviation σ of the log-ratio. For most small-molecule oral generics, this is achievable with 24–36 healthy volunteers in a crossover design (FDA 2003, Appendix A; Senn 2007, ch. 11). Specialised molecules (highly variable drugs, narrow therapeutic index) require larger n or modified procedures.

Non-inferiority: a one-sided equivalence

Non-inferiority is the asymmetric cousin of equivalence. The setup: a new treatment is compared to an established control. The aim is to show the new treatment is "not meaningfully worse" — i.e., $\Delta = \mu_{\text{new}} - \mu_{\text{control}} > -\delta$ for some pre-specified non-inferiority margin $\delta > 0$ .

This is one half of the TOST procedure — only the LOWER one-sided test. The upper side (does the new treatment differ from the control in the OTHER direction, i.e., is it BETTER?) is deliberately not part of the non-inferiority claim. If a non-inferiority trial also wishes to establish superiority, this is done as a hierarchical testing strategy: first establish non-inferiority (reject $H_0^L$ ), then check whether the lower bound of the CI exceeds 0 (reject the conventional $H_0$ ). The CONSORT extension for non-inferiority (Piaggio et al. 2012, JAMA) standardises the reporting of these hierarchical claims.

The margin δ in non-inferiority trials is typically set as a fraction (commonly 50%) of the historical superiority effect of the standard treatment over placebo, with both regulatory and clinical input. The FDA 2016 Non-Inferiority Clinical Trials guidance and EMA 2005 equivalent describe the framework formally; Wellek (2010, ch. 6) gives the textbook treatment.

Power calculations for TOST

The power of TOST is the probability that BOTH one-sided tests reject under a specific true Δ. Schuirmann (1987, §3) gave the formula in terms of the non-central t-distribution; the closed-form is

\mathrm{Power}(\Delta) = P_t\!\left(\frac{|\Delta - 0| - \delta}{se}\,;\, df,\, \alpha\right) - P_t\!\left(\frac{-|\Delta - 0| + \delta}{se}\,;\, df,\, 1 - \alpha\right)

but the operational shortcut is simpler. At Δ = 0 (perfect equivalence), the power is approximately $\Phi(\delta/se - z_{1-\alpha}) - \Phi(-\delta/se + z_{1-\alpha})$ . At Δ on the boundary $|\Delta| = \delta$ , the power drops to α (the Type-I rate). At Δ outside the margin, the power drops to zero. The visual: power is HIGHEST at the centre Δ = 0, drops smoothly to α at the boundaries, and decays beyond. Larger n drives the whole curve up; larger δ also drives it up (the margin is more permissive).

For design, the working rule is: pick δ from subject-matter knowledge, pick the smallest practically-meaningful target power (usually 0.80 or 0.90) at Δ = 0, solve for n given σ. The lake of software for this is broad (R packages TOSTER, PowerTOST; Stata's tostmean; G*Power 3.1 with the appropriate option). Lakens (2017) walks the entire workflow with worked examples.

Bayesian alternatives: ROPE and Bayes factors

The Bayesian community has its own equivalence machinery. Two common procedures.

ROPE-vs-posterior (Kruschke 2018, Adv Methods Pract Psychol Sci). The Region of Practical Equivalence (ROPE) is the same idea as the equivalence margin — a pre-specified interval $[-\delta, +\delta]$ within which the effect is "practically equivalent to zero". Run a full Bayesian analysis to obtain the posterior for Δ. The Bayesian equivalence decision: reject the "non-equivalence" model if the 95% highest-density interval (HDI) of the posterior lies entirely inside the ROPE. This is the direct Bayesian analogue of the Westlake–Berger–Hsu CI test, with the Bayesian credible interval replacing the frequentist confidence interval.
Bayes factor for equivalence vs difference model. Define two models: M_equiv (Δ ~ uniform on $[-\delta, +\delta]$ ) and M_diff (Δ ~ wider prior, e.g., Cauchy(0, scale)). Compute the Bayes factor BF_{equiv,diff}. Large BF_{equiv,diff} = evidence for equivalence. This is the model-comparison framing favoured by some Bayesians (Wagenmakers et al. 2018, Psychon Bull Rev).

Both Bayesian routes share an important feature: they REQUIRE the same up-front commitment to δ that TOST does. The choice of δ is the load-bearing decision regardless of whether the inference is frequentist or Bayesian. Bayesian methods do not save you from having to pre-specify what counts as equivalence.

Honest caveats

Equivalence testing is not a silver bullet. Four honest caveats.

The choice of δ is the most important step. A too-large δ makes the test trivial (anything looks equivalent); a too-small δ makes it impossible to pass (no realistic n delivers a CI that narrow). δ MUST be justified by subject-matter knowledge — typically clinical (the smallest difference that would affect treatment decisions), regulatory (the FDA 0.80/1.25 margin), or theoretical (the smallest effect predicted by the theory under test). Lakens (2017) is the practical primer; Wellek (2010, ch. 1) is the formal one.
Equivalence testing has its own multiple-comparisons problem. If you test multiple endpoints for equivalence in the same trial, the §2.5 family-wise control still applies. Bioequivalence trials handle this by pre-specifying ONE primary endpoint (usually log-AUC) and treating others as secondary with no formal correction. Reading Senn (2007, ch. 12) is useful here.
"Within the margin" is not the same as "equal". A successful TOST does NOT prove the two treatments are identical — only that any difference is bounded by δ. The size of δ matters: a TOST that passes with δ = 0.50 σ is a much weaker claim than the same TOST passing with δ = 0.10 σ. Reporting should always include both the verdict AND the CI for Δ (so readers can see the effective margin).
δ-hacking is the equivalence-test analogue of p-hacking. The same way picking outcomes post-hoc to find a significant p inflates false positives, picking δ post-hoc to find a successful TOST inflates false equivalence claims. The fix is the same as in §2.6: PREREGISTER δ. Lakens (2017) is explicit about this; the preregistration template in §2.6 includes "smallest effect of interest" as a separate field for exactly this reason.

Try it

In the tost-explorer, set δ = 1.0, Δ = 0, n = 30, σ = 2.0, α = 0.05. Note the verdict (typically YELLOW). Now drag n up to 200 and watch the CI shrink until the verdict turns GREEN. Read off the smallest n at which GREEN first holds. That is the EFFECTIVE sample size for an equivalence trial on this (δ, σ) design — the operational answer to "how many subjects do we need?".
Same widget. Set δ = 0.5, Δ = 0, σ = 2.0, n = 60. Read off the verdict. Now keep δ = 0.5 but reduce σ to 0.5 (perfect inter-subject homogeneity). The CI shrinks 4× and the verdict turns GREEN. Argue from this: bioequivalence trials use within-subject crossover designs because the residual σ in a crossover is much smaller than between-subject, reducing the required n.
Same widget. Set Δ = 0.5, δ = 1.0, n = 40, σ = 1.0. Verdict typically GREEN (CI sits around [0.2, 0.8] — inside the margin). Now switch α from 0.05 to 0.025. The CI widens (now 95%), the verdict may flip to YELLOW. Argue why per-side α controls how conservative the equivalence test is: smaller α = wider CI = harder to fit in the margin.
In the equivalence-vs-superiority widget, click "Underpowered (Δ ≈ 0, small n)". Note the dominant quadrant (NN, ~ 90% of replicates). Now click "Truly equivalent (Δ = 0, big n)". The dominant quadrant shifts to EN. Argue from the two cells: BOTH have Δ = 0, but only the well-powered design supports the equivalence claim. "No significant difference" in the underpowered design is uninformative.
Same widget. Click "Small but real". Read off the EE frequency. Argue why the EE cell is exactly the bioequivalence case: a small but real effect that is still small enough to fall within the regulatory equivalence margin. The generic IS slightly different from the brand, but the difference is below clinical relevance.
Pen-and-paper. FDA bioequivalence sets the equivalence margin at log(0.80, 1.25) for the log-ratio of AUC. (a) Show this margin is symmetric on the log scale (i.e., log(1/0.80) = log(1.25)). (b) Compute the half-width on the log scale: log(1.25) ≈ 0.2231. (c) Suppose σ on the log scale is 0.15 (a typical between-subject CV of ~ 15%) and the trial is a crossover with σ_within = 0.075. For a per-side α = 0.05 trial with n subjects each measured in both arms, derive the approximate n needed to deliver power = 0.80 at Δ = 0. (Use Schuirmann's approximation: n ≈ 2 (z_{1-α} + z_{1-β})²·σ_within² / δ².)
Pen-and-paper. Walker & Nowacki (2011, J Gen Intern Med) report a non-inferiority trial of a new antibiotic with NI margin δ = 10 percentage points on the cure rate, n = 200 per arm, observed difference = −3 percentage points, 95% CI for the difference = [−10.5, +4.5]. Argue from the CI: does this trial demonstrate non-inferiority at the per-side α = 0.025 level? (Hint: the 95% CI is the 97.5%-per-side CI; the relevant one-sided test is at α = 0.025; the lower bound −10.5 is just outside −10.) What would the CI need to look like to declare non-inferiority?
Pen-and-paper. A replication study aims to test whether an original finding of d = 0.40 (p = 0.04, n = 50) reproduces. The replicators pre-register a SESOI of d = 0.20 (the smallest effect that would be theoretically meaningful) and plan TOST at per-side α = 0.05. The replication reports d = 0.05, p = 0.30, n = 150. Compute the TOST p-values (approximate, using normal-approximation se ≈ √(2/n)) for the lower and upper one-sided tests. Conclude: does the replication support equivalence to zero within the SESOI, or is it inconclusive?
Pen-and-paper. Distinguish carefully: (a) "the trial failed to reject H₀" vs (b) "the trial demonstrated equivalence" vs (c) "the trial demonstrated non-inferiority". For each pair, name one literature claim that conflates them and the correct procedural fix. Cite Altman & Bland (1995) for (a) ↔ (b) and Piaggio et al. (2012) for (b) ↔ (c).
Pen-and-paper. Berger & Hsu (1996, Stat Sci) prove the intersection-union test (IUT) result: an IUT formed from level-α tests of each piece of a composite null has overall level α — no multiplicity correction needed. Sketch the proof for the two-piece TOST case: under any specific $\Delta_0$ with $|\Delta_0| \geq \delta$ , what is the maximum Type-I rate of the IUT? Why does it equal α, not 2α?
Pen-and-paper. Schuirmann (1987) compared TOST to the "power approach" — declare equivalence if a t-test fails to reject H₀ AND the trial had nominal power ≥ 0.80 against |Δ| = δ. Argue why the power approach is NOT a valid α-level equivalence test in general (hint: it conditions on a separate event, the power calculation, which depends on σ — but σ is estimated from the same data, so the Type-I rate is uncontrolled). Conclude why the IUT-based TOST is the regulatorily-recognised procedure.
Pen-and-paper. Lakens (2017) recommends preregistering δ as part of the §2.6 preregistration. Argue why post-hoc δ choice is the equivalence-test analogue of p-hacking: list two specific failure modes (e.g., picking δ = 0.5 if observed |Δ| = 0.3, vs picking δ = 0.2 if observed |Δ| = 0.1) and explain how each inflates the false-equivalence rate. Use the §2.4 garden-of-forking-paths language.

Pause and reflect: §2.7 has built the formal machinery for SUPPORTING the null. The TOST procedure (Schuirmann 1987) collapses to the intersection of two one-sided tests at α, equivalent to checking that the 100(1 − 2α)% CI fits inside the pre-specified margin (−δ, +δ) (Westlake 1976; Berger & Hsu 1996). The two widgets give the visceral picture: the tost-explorer shows the CI sliding in and out of the margin band as you drag δ, Δ, n, σ, α; the equivalence-vs-superiority 2 × 2 shows the four orthogonal outcomes, with the UNDERPOWERED corner as the literature's most common misinterpretation. Three settings demand this machinery: bioequivalence (FDA 0.80/1.25 margin on log-AUC), non-inferiority (one-sided TOST), and replication (SESOI). The load-bearing decision is δ, which MUST be preregistered (Lakens 2017). What equivalence testing does NOT do: prove exact equality, fix bad measurement, fix small n, or avoid its own multiple-comparisons issues. §2.8 will integrate §2.4–§2.7 into the replication crisis — what happens to a literature when these procedural disciplines fail systematically across a generation.

What you now know

You can state the ASYMMETRY of classical NHST: failure to reject H₀ is NOT evidence for H₀. "Absence of evidence is not evidence of absence" (Altman & Bland 1995, BMJ). You can name the three classes of research question that REQUIRE supporting the null: bioequivalence (generic = brand), non-inferiority (new ≥ old by a margin), replication / SESOI (effect smaller than what would be meaningful). You can write down the TOST procedure (Schuirmann 1987): two one-sided tests at α each, reject the equivalence H₀ iff both reject; equivalent (Westlake 1976; Berger & Hsu 1996) to: the 100(1 − 2α)% CI for Δ fits inside (−δ, +δ).

You can explain why TOST controls Type-I at α (not 2α) via the intersection-union construction (Berger & Hsu 1996, Theorem 1): under any composite null, the rejection event requires both one-sided rejections, and at most one of those is approximately the worst case. You can compute the FDA bioequivalence margin: log(0.80, 1.25) on the log-AUC scale, symmetric on the log because $\log(1/0.80) = \log(1.25) \approx 0.223$ . You can distinguish equivalence (symmetric: $|\Delta| < \delta$ ) from non-inferiority (one-sided: $\Delta > -\delta$ ).

You can use the power formula for TOST (Schuirmann 1987): power is maximal at Δ = 0, drops to α at the boundary |Δ| = δ, and decays beyond. You can name the Bayesian alternatives: ROPE-vs-posterior (Kruschke 2018) and Bayes factors for equivalence model vs difference model. You can identify the load-bearing decision (the choice of δ) and the honest caveats: bad δ = useless test; δ MUST be preregistered (Lakens 2017) or the test is post-hoc-hackable; equivalence has its own multiple-comparisons issues; "within margin" is not "exactly equal".

You can run the tost-explorer widget and watch the CI slide in and out of the margin band, recognising that GREEN means CI inside the margin (equivalence concluded), RED means CI outside (inequivalence), YELLOW means CI straddles boundary (inconclusive — typically under-powered for the chosen δ). You can run the equivalence-vs-superiority widget and identify the four interpretive quadrants — "small but real", "truly no difference", "large clear effect", "underpowered" — recognising that they are mutually exclusive and that the dangerous UNDERPOWERED corner is the literature's most common misinterpretation.

Where this lands in Part 2. §2.8 integrates §2.4 (p-values), §2.5 (multiple testing), §2.6 (preregistration), and §2.7 (equivalence) into the REPLICATION CRISIS — the literature-scale consequences when these procedural disciplines fail systematically across a generation of researchers.

References

Schuirmann, D.J. (1987). "A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability." Journal of Pharmacokinetics and Biopharmaceutics 15(6), 657–680. (The canonical TOST paper. Introduces the two one-sided tests procedure, proves it controls Type-I at α, derives the power formula, and contrasts with the older "power approach". The basis for FDA bioequivalence.)
Westlake, W.J. (1976). "Symmetrical confidence intervals for equivalence trials." Biometrics 32, 741–744. (The CI-based characterisation of equivalence: the equivalence claim is supported iff the 100(1 − 2α)% CI fits inside the margin. Precedes Schuirmann's formal TOST framing.)
Berger, R.L., Hsu, J.C. (1996). "Bioequivalence trials, intersection–union tests, and equivalence confidence sets." Statistical Science 11(4), 283–319. (The formal IUT treatment of TOST: an intersection-union test formed from level-α tests of each piece of a composite null has overall level α, no multiplicity correction. The mathematically clean foundation for TOST.)
Lakens, D. (2017). "Equivalence tests: a practical primer for t tests, correlations, and meta-analyses." Social Psychological and Personality Science 8(4), 355–362. (The practical primer for working researchers: how to choose δ, how to preregister it, how to run TOST in the standard statistical software, with worked examples for t-tests, correlations, and meta-analyses.)
FDA (2003). Guidance for Industry: Bioavailability and Bioequivalence Studies for Orally Administered Drug Products — General Considerations. U.S. Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research. (The regulatory bible for bioequivalence trials: the log(0.80, 1.25) margin on log-AUC, the 90% CI rule, the crossover design conventions, the criteria for highly variable drugs.)
Wellek, S. (2010). Testing Statistical Hypotheses of Equivalence and Noninferiority (2nd ed.). Chapman & Hall / CRC Press. (The book-length treatment: equivalence and non-inferiority machinery for t-tests, ANOVA, regression, binomial proportions, survival, with full mathematical formalism and worked examples.)
Senn, S. (2007). Statistical Issues in Drug Development (2nd ed.). Wiley. (The drug-development practical reference: chapters 12 and 22 cover bioequivalence and non-inferiority in the regulatory context, with extensive worked examples and the operational subtleties.)
Walker, E., Nowacki, A.S. (2011). "Understanding equivalence and noninferiority testing." Journal of General Internal Medicine 26(2), 192–196. (The clinician-targeted overview: distinguishes equivalence from non-inferiority, explains the CI interpretation, walks through worked clinical examples. Accessible for medical readers without statistical training.)
Altman, D.G., Bland, J.M. (1995). "Absence of evidence is not evidence of absence." BMJ 311(7003), 485. (The canonical short editorial naming the NHST asymmetry. One page, hugely influential. The phrase has become a stock reminder in clinical statistics textbooks.)
Hoekstra, R., Finch, S., Kiers, H.A.L., Johnson, A. (2006). "Probability as certainty: dichotomous thinking and the misuse of p values." Psychonomic Bulletin & Review 13(6), 1033–1037. (The empirical demonstration that 78% of psychology papers conflate "p > 0.05" with "no effect". The motivating literature for the equivalence-testing reform movement.)
Piaggio, G., Elbourne, D.R., Pocock, S.J., Evans, S.J.W., Altman, D.G., CONSORT Group (2012). "Reporting of noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement." JAMA 308(24), 2594–2604. (The CONSORT extension for non-inferiority and equivalence trials: required reporting items, the hierarchical-testing convention for combined NI+superiority claims.)
Kruschke, J.K. (2018). "Rejecting or accepting parameter values in Bayesian estimation." Advances in Methods and Practices in Psychological Science 1(2), 270–280. (The ROPE-vs-posterior framework: the Bayesian analogue of TOST, with the 95% HDI replacing the 90% CI and the ROPE replacing the equivalence margin.)
Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Love, J., Selker, R., Gronau, Q.F., Šmíra, M., Epskamp, S., Matzke, D., Rouder, J.N., Morey, R.D. (2018). "Bayesian inference for psychology. Part I: theoretical advantages and practical ramifications." Psychonomic Bulletin & Review 25(1), 35–57. (The Bayes-factor framework for equivalence vs difference models, with the prior conventions and the operational interpretation of BF values.)