Multiple testing: FWER and FDR
Learning objectives
- Explain why testing k > 1 hypotheses at level α each gives an EFFECTIVE Type-I rate above α: under independence, the chance of ANY false positive is , which exceeds α and grows in k
- Define the Family-Wise Error Rate (FWER) as where V is the count of false positives among the k tests; "family-wise" because it is a global error rate over the FAMILY of tests reported as one decision unit
- State BONFERRONI's correction: test each at ; this controls regardless of dependence among the tests, by Boole's inequality (the union bound)
- State ŠIDÁK's correction (small α) as the EXACT FWER cutoff under INDEPENDENCE; slightly tighter than Bonferroni in that case, but rarely used in practice because the independence requirement is hard to verify
- State HOLM-BONFERRONI (1979) as a sequential STEP-DOWN procedure: sort p-values ; reject for the smallest such that , then continue until one fails; STRICTLY MORE POWERFUL than Bonferroni at the same FWER ≤ α
- Define the False Discovery Rate as where R = V + S is the total number of rejections; the rate is the expected FRACTION of rejections that are wrong, NOT the probability of any wrong rejection — a fundamentally different target than FWER
- State the BENJAMINI-HOCHBERG (1995) procedure: sort ; find the LARGEST such that ; reject ; this STEP-UP procedure controls under independence or positive regression dependence (Benjamini-Yekutieli 2001)
- Compare FWER vs FDR control by use case: FWER is appropriate when ANY false positive is catastrophic (confirmatory drug trials, fundamental-discovery claims); FDR is the modern default in high-throughput settings (genomics, neuroimaging) where many discoveries are expected and a controlled FRACTION of false positives is the right target
- Recognise BH's dependence caveat: BH (1995) is valid under independence and positive regression dependence (PRDS); for ARBITRARY dependence, use BENJAMINI-YEKUTIELI (2001), which scales the line by — more conservative
- Identify advanced multiple-testing constructs from the literature: q-VALUES (Storey 2002) as a per-hypothesis FDR analogue of p-values; LOCAL FDR (Efron 2010) for empirical-Bayes contexts; HIERARCHICAL / GATEKEEPING procedures for multi-endpoint clinical trials; RESAMPLING-BASED corrections (Westfall-Young 1993) for correlated tests
- Diagnose when corrections matter: with k small, naive testing already inflates Type-I noticeably; with k = 20,000 (a genome scan), Bonferroni demands for ANY rejection, which is why FDR is the genomics default; pre-specify ALL tests in the analysis plan and report which correction was applied
§2.1 framed hypothesis testing as a single Neyman-Pearson decision with rates . §2.2 turned those rates into research-planning numbers. §2.3 walked the t, χ², F mechanics that produce p-values, and §2.4 said precisely what those p-values are: tail probabilities under , no more, no less. §2.4 also previewed one of the worst failure modes of careless p-value use — multi-outcome fishing — where testing outcomes at α each silently inflates the effective Type-I rate from α to . §2.5 is the formal-fix section: how do you test many hypotheses while STILL controlling a sensible global error rate?
The §2.5 arc has six stops. First, the CORE PROBLEM — k tests at α each yield expected false positives under , and the chance of AT LEAST ONE false positive grows toward 1. Second, the FAMILY-WISE ERROR RATE (FWER) — the probability of ANY false positive — and its three classical controllers: BONFERRONI (test at α/k, by the union bound), ŠIDÁK (the exact independence cutoff), and HOLM-BONFERRONI (sequential, strictly more powerful). Third, the FALSE DISCOVERY RATE (FDR) — the expected FRACTION of rejections that are wrong — and the BENJAMINI-HOCHBERG 1995 step-up procedure, whose modern dominance in genomics changed how everyone thinks about high-throughput testing. Fourth, the WHEN-WHICH question: FWER for confirmatory and high-stakes work, FDR for exploration in many-test settings. Fifth, the HONEST CAVEATS: BH assumes independence or positive dependence; for arbitrary dependence, Benjamini-Yekutieli is the conservative alternative. Sixth, MODERN EXTENSIONS — q-values, local FDR, gatekeeping, resampling — pointers for further study.
One framing note before the formulas. FWER and FDR are DIFFERENT quantities; they answer different questions. FWER asks "what is the probability that I make at least one false rejection?" — a yes/no question about the family as a whole. FDR asks "of the things I reject, what fraction is expected to be wrong?" — a continuous-quality question about the rejection set. Picking between them is a STATEMENT ABOUT THE COST OF FALSE POSITIVES: if even one false positive ruins your analysis (an FDA approval, a Nobel-claim discovery), pick FWER. If you expect many real effects and tolerate a small bounded contamination (a genome scan finding 200 hits, 10 of which may be false), pick FDR. There is no universally right answer; there is a right answer per problem.
The core problem: testing more inflates Type-I
Start with the cleanest model. Suppose you run INDEPENDENT tests, each at level α, and the global null holds (every individual is true). Each test's p-value is Uniform[0, 1] under its own (§2.4 fact), so the rejection indicator is Bernoulli(α). The count of false positives . Two consequences:
- Expected false positives: . With and , — you expect ONE false positive when ALL nulls are true. With , you expect five.
- Probability of at least one false positive: . For : . For : . By you are almost certain to declare at least one bogus discovery.
The genomics-era version of this calculation is what cemented FDR control as the practical default. A standard differential-expression scan tests genes. At nominal α = 0.05 with no correction: false positives even if NO gene is differentially expressed. A scientist wanting to publish needs to control this — but the FWER correction Bonferroni demands per-gene α = , and at that threshold the test is so conservative that real moderate-effect genes are missed. Hence Benjamini-Hochberg: tolerate a controlled fraction of false positives in exchange for many more true positives.
FWER control: Bonferroni, Šidák, Holm
BONFERRONI. The simplest, most general, and weakest FWER controller. Reject if . By the union bound (Boole's inequality),
Crucially, the union bound does NOT require independence — Bonferroni controls FWER under ANY dependence structure. The cost is power: at , each test must clear ; at , . Bonferroni is the right tool when the family is small () or when control under unknown dependence is non-negotiable (e.g., the primary endpoints of a regulatory clinical trial).
ŠIDÁK. Under INDEPENDENCE, the exact per-test cutoff is . By construction , so FWER = α exactly. For small α and moderate k, — barely tighter than Bonferroni. Šidák (1967) is mathematically correct under independence and slightly more permissive, but practitioners rarely use it because verifying independence in real test families is hard, while Bonferroni is dependence-free.
HOLM-BONFERRONI (1979). A sequential STEP-DOWN procedure that is STRICTLY MORE POWERFUL than Bonferroni at the same FWER ≤ α. The algorithm:
- Sort the p-values ascending: .
- For : if , reject and continue.
- At the first where the condition fails, STOP. Reject ; do not reject .
The first comparison is to (same as Bonferroni for the smallest p). The SECOND is to — more permissive. The third to . Holm is strictly more powerful: every p-value Bonferroni rejects, Holm also rejects (because the first comparison is identical), and Holm rejects MORE because the later thresholds are easier. FWER ≤ α follows from a closed-testing argument (Marcus, Peritz, Gabbai 1976; Holm 1979). Holm controls FWER under ARBITRARY dependence — same generality as Bonferroni, strictly better power. The §2.5 verdict: use Holm-Bonferroni in place of Bonferroni whenever FWER control is the goal.
FDR control: the Benjamini-Hochberg procedure
FWER is conservative by design: it controls the probability of ANY false positive, treating the family as a single bet. In settings where you EXPECT many discoveries, that is wasteful — you would rather control the fraction of false positives among rejections. Benjamini & Hochberg (1995) introduced the FALSE DISCOVERY RATE and the procedure that controls it. Let = number of false rejections, = number of true rejections, = total rejections. The FDR is
The in the denominator is the BH 1995 convention: when (no rejections), the ratio is 0/1 = 0 (no false discoveries, vacuously). The expectation is over the data-generating process at the truth.
The BH 1995 step-up procedure. Given p-values and target FDR α:
- Sort ascending: .
- Find the LARGEST such that .
- If such exists, reject . Otherwise, reject nothing.
Two important features of BH. First, it is a STEP-UP procedure (you walk from the LARGEST p downward and stop at the first one that satisfies the bound). Second, the rejection set is everything to the LEFT of — including p-values whose own at their own rank does NOT satisfy . The procedure rejects based on the CUMULATIVE pattern of small p's, not the per-rank one.
Benjamini & Hochberg (1995, JRSSB) proved that under INDEPENDENCE of the test statistics, this procedure controls , where is the unknown number of true nulls. Benjamini & Yekutieli (2001, Annals of Statistics) extended the result to POSITIVE REGRESSION DEPENDENCE ON SUBSETS (PRDS), which covers most realistic correlated-tests settings (one-sided tests on multivariate-normal statistics with non-negative covariances, for instance). For ARBITRARY dependence, BY 2001 showed BH may inflate FDR to up to — at , that is about . To control FDR under arbitrary dependence, use the BY-corrected line .
The first widget simulates the four strategies on the SAME families of tests with a controllable fraction of true alternatives at effect size , and reports the empirical FWER, FDR, and power for each method.
Things to verify in the widget:
- Set (global null), , , irrelevant. The Naive empirical FWER should land around — nearly every family has at least one false positive. Bonferroni/Holm/BH all sit at or below 5% — by design. With , FDR = FWER for every method, because every rejection is by definition false.
- Set (10% true alternatives), , . Compare BH's power to Holm's — BH should be higher, often 10-30 percentage points more, because BH tolerates a few false positives in exchange for many more true ones. Meanwhile, BH's empirical FDR should still hover at or below α.
- Slide from 10 to 1000 with fixed. Bonferroni's power collapses (the per-test threshold becomes tiny); Holm matches Bonferroni's shape but is uniformly better; BH's power is the most robust to growing k because its critical line scales with rank, not with k alone.
- Set (a "rich" world where half the hypotheses are true alts) at moderate . BH may reject substantially more than Holm here — this is the regime where FDR's "controlled fraction" really pays. The naive method's FWER is still much higher than its FDR, because the FP count is small relative to the much larger TP count, so the rejection set is mostly correct.
- Re-roll a few times. Monte Carlo noise at is around ±1 percentage point on the rates. Push to 5000 sims for tighter estimates if you need them.
The BH procedure in pictures
The classical BH visualisation is to plot the sorted p-values against rank , overlaid with the BH critical line . The Benjamini-Hochberg 1995 paper Figure 1 introduced this plot, and it remains the cleanest way to see what the procedure actually does. The second widget shows this picture on a concrete family — you can edit any rank's p-value with a slider (the family stays sorted) and watch the rejection set update.
Things to verify:
- Load the "Strong-effect mix" preset at α = 0.05. The smallest 5 p-values lie well below the BH diagonal; the largest 15 lie far above. BH's should be 5 — every rank from 1 to 5 is rejected. The Bonferroni horizontal line at α/k = 0.0025 catches the smallest 3 or 4 only — BH wins the EXTRA two on this family.
- Load the "All-null" preset at α = 0.05. The smallest p is around 0.024 — above α/20 = 0.0025 (so Bonferroni rejects nothing) and also above 1/20 · 0.05 = 0.0025 (so BH rejects nothing either). Both methods correctly stay quiet.
- Load the "Borderline" preset. The smallest p is 0.001 (Bonferroni rejects), but the rest taper smoothly. Watch BH walk from rank 20 leftward — the first rank where is around or 4 (depending on α). BH rejects everything to its left, including ranks whose individual p is ABOVE its own critical (this is the step-up property).
- Slide α from 0.05 down to 0.005 with any preset loaded. The BH diagonal becomes flatter; fewer ranks cross below it. At α = 0.001, even strong-effect families lose most rejections — the tradeoff between α and power is just as direct under FDR control as under raw Neyman-Pearson.
- Drag an individual p_(j) slider for a single rank with the "Strong" preset loaded. Notice the table re-sorts on every drag — the sliders are labelled by RANK, not by hypothesis identity. This is the right mental model: BH only sees the ordered p-vector.
FWER vs FDR: how to choose
The choice is a STATEMENT ABOUT THE COST OF FALSE POSITIVES. A short table of guidance:
- Use FWER (Bonferroni or Holm) when: the family is small (); a single false positive is catastrophic (drug approval primary endpoint, fundamental-discovery claim, safety-monitoring signal); regulatory or pre-registered confirmatory analysis; the audience expects a binary decision per hypothesis with strong protection against any error.
- Use FDR (Benjamini-Hochberg) when: the family is large (); discoveries are expected and false positives are tolerable as long as their FRACTION is bounded (gene-expression scans, neuroimaging voxel-wise tests, candidate-feature ranking in machine learning); the analysis is exploratory or hypothesis-generating; the audience wants a ranked list of discoveries with a controlled contamination rate.
- Use Benjamini-Yekutieli FDR when: the test statistics have ARBITRARY dependence (e.g., correlated time-series tests with negative correlations); BH may not control FDR in this regime, BY does at the cost of a shrinkage of the line.
- Use gatekeeping / hierarchical procedures when: the family has structure (primary endpoints first, then secondary; one experimental arm gates further analyses on follow-up arms); these procedures preserve FWER while exploiting the structure for power.
One non-controversy: WHATEVER you use, REPORT it. The ASA 2016 principle #4 (full transparency) demands disclosure of all tests run and all corrections applied. A modern paper that reports "p = 0.03 after BH correction across 12 outcomes" is interpretable; a paper that reports "p = 0.03" with no description of the rest of the test family is not.
Beyond the basics: q-values, local FDR, resampling
For a complete pointer to the modern multiple-testing literature, three constructs are worth recognising even if you do not deploy them every day.
q-values (Storey 2002). The BH-rejection threshold depends on . Storey turned this around: for each test, define the -value as the SMALLEST FDR target α at which BH would still reject this hypothesis. The q-value is the FDR analogue of the p-value — interpretable per-hypothesis without picking α in advance. Storey's q-value also estimates (the proportion of true nulls) from the p-value distribution, giving a slightly more powerful procedure than BH 1995 in many practical contexts. The R package qvalue is the canonical implementation.
Local FDR (Efron 2010). BH controls the AVERAGE false-discovery rate across the rejection set; the LOCAL FDR asks instead "for THIS specific p-value, what is the posterior probability that the corresponding null is true?" — a Bayesian-flavoured quantity that ranks rejections by per-feature credibility. Efron's Large-Scale Inference (2010) is the standard reference; the empirical-Bayes machinery there underpins much of modern genomics analysis.
Resampling-based corrections (Westfall-Young 1993). When the test statistics are heavily correlated, neither Bonferroni nor BH exploits the correlation structure. Westfall-Young's permutation/bootstrap procedures use the data's own dependence structure to compute exact (or near-exact) FWER cutoffs. The cost is computational — you re-simulate the test family hundreds or thousands of times. The benefit is potentially much higher power when correlations are strong. The R package multtest implements Westfall-Young step-down.
Gatekeeping procedures (Bauer 1991; Dmitrienko, Tamhane, Bretz 2009). Clinical trials with primary, secondary, and tertiary endpoints use hierarchical testing strategies that exploit the endpoint ordering. The simplest form: test the primary at α; only if it rejects, test the secondaries. More elaborate versions distribute α across endpoints and recycle the "saved" α from rejected primaries. The Hochberg-Tamhane (1987) Multiple Comparison Procedures is the standard reference.
Try it
- In the multitest-explorer widget, set k = 100, π₁ = 0, α = 0.05, families = 2000. Verify Naive empirical FWER ≈ . Bonferroni, Holm, BH should all be near 5%. Re-roll; the bars dance around within ±1 pp of those values.
- Same widget, π₁ = 0.10, δ = 2σ, k = 100. Compare Bonferroni power vs Holm power vs BH power. Holm should be 0-5 pp higher than Bonferroni (Holm is strictly more powerful than Bonferroni); BH should be 15-30 pp higher than Holm. Sanity-check: BH's FDR should sit near α; Holm's and Bonferroni's FDR should be well below α (they are conservative for FDR because they target FWER).
- Same widget, slide k from 10 → 1000 at fixed π₁ = 0.10, δ = 2σ. Watch Bonferroni's power crater (its per-test threshold α/k shrinks linearly in k). Holm's power degrades almost as fast (the first rank still has threshold α/k). BH's power decays much more slowly because its critical line scales with rank.
- In the bh-procedure-step widget, load the "Strong" preset at α = 0.05. Identify j* in the table — it should be 5 (the smallest five p-values clear the BH diagonal). The Bonferroni rejection set (last column) is ranks 1 and 2 only (p_(1) = 0.0002 and p_(2) = 0.0009 are below α/k = 0.0025; the rest exceed it). The three additional BH rejections (ranks 3, 4, 5) are the FDR-vs-FWER bargain.
- Same widget, "Borderline" preset at α = 0.05. Walk through the BH algorithm by hand on the table: for j = 20, is p_(20) ≤ (20/20)·0.05 = 0.05? Probably not. For j = 19, is p_(19) ≤ 0.0475? Continue downward until the first YES — that's j*. Verify the widget agrees.
- Same widget, "Dense" preset at α = 0.05. This mimics a genome-scan family with many small p-values. j* should be large (~12-15) — BH rejects most of the family, while Bonferroni catches only the very smallest 3-4. The gap is the visceral demonstration of why FDR became the genomics default.
- Pen-and-paper. A clinical trial reports six secondary-endpoint p-values: 0.001, 0.008, 0.019, 0.035, 0.041, 0.062. Apply Bonferroni at α = 0.05: reject only those below 0.05/6 = 0.00833 — that is the first TWO (0.001 and 0.008, both below 0.00833). Apply Holm: compare 0.001 to 0.05/6 (yes), 0.008 to 0.05/5 = 0.01 (yes), 0.019 to 0.05/4 = 0.0125 (no, stop). Holm rejects the first two — same as Bonferroni here. Apply BH at α = 0.05: largest j with p_(j) ≤ (j/6) · 0.05 — check j=6: 0.062 ≤ 0.05 (no); j=5: 0.041 ≤ 0.0417 (YES). BH rejects all of ranks 1-5. The three methods give three different rejection sets on the same data; the conclusions about which secondaries "matter" depend on which correction you chose.
- Pen-and-paper. A genomics study tests k = 20,000 genes. At Bonferroni α = 0.05 per family, the per-gene threshold is 0.05/20000 = 2.5 × 10⁻⁶. If true alternatives produce effects requiring p ≈ 0.001 for detection, Bonferroni misses essentially all of them. At BH α = 0.05, suppose the empirical p-distribution shows 500 genes with p < 0.01. Roughly, BH rejects ~all of them with FDR ≈ 5%, expected false rejections ≈ 25. Discuss: in what kind of biology study is "discover 475 true effects, accept ~25 false ones" the right tradeoff? In what study would you reject this in favour of Bonferroni's "discover 0 effects, certify 0 false rejections"?
- Pen-and-paper. For arbitrary dependence among the tests, BH 1995 may inflate FDR up to . Compute the inflation factor at k = 10, 100, 1000. (10: 2.93; 100: 5.19; 1000: 7.49.) For control at level α under arbitrary dependence (Benjamini-Yekutieli 2001), use the line instead. At k = 100, that's — the BY threshold is about 5x stricter than BH. Discuss when this trade is worth it.
- Pen-and-paper. Suppose you ran k = 5 outcome tests, got p-values 0.012, 0.024, 0.041, 0.06, 0.08, and now want to report which are "significant after correction at α = 0.05". Compute the rejection set under (a) Bonferroni, (b) Holm, (c) BH, (d) Šidák. (a) Bonf: 0.05/5 = 0.01; nothing rejected. (b) Holm: p_(1) = 0.012 vs 0.01 — fail; nothing rejected. (c) BH: largest j with p_(j) ≤ j·0.01 — j=5: 0.08 ≤ 0.05 no; j=4: 0.06 ≤ 0.04 no; j=3: 0.041 ≤ 0.03 no; j=2: 0.024 ≤ 0.02 no; j=1: 0.012 ≤ 0.01 no. NOTHING rejected. (d) Šidák: per-test α = 1 - 0.95^(1/5) ≈ 0.0102; the smallest p 0.012 > 0.0102 — nothing rejected. All four methods agree HERE — the smallest p simply isn't small enough. The differences only emerge for stronger-signal families.
Pause and reflect: §2.5 has formalised the multiple-testing problem and given you two distinct global error rates to control. FWER (Bonferroni, Holm) is the right target when ANY false positive is catastrophic; FDR (Benjamini-Hochberg) is the right target when discoveries are expected and false positives are tolerable as long as their FRACTION is bounded. The four classical procedures — Bonferroni (general but conservative), Šidák (exact under independence, rarely worth the assumption), Holm-Bonferroni (strictly better than Bonferroni at no cost), Benjamini-Hochberg (the modern default for high-throughput) — span the practical space. Beyond them, q-values, local FDR, resampling-based corrections, and gatekeeping all live in the same procedural toolkit, each appropriate to a particular dependence structure or family geometry. The single procedural commitment §2.5 makes: PRE-SPECIFY the family of tests in your analysis plan, REPORT the correction you applied, and DO NOT add tests after seeing the data. §2.6 will work out the preregistration mechanics that make this commitment concrete.
What you now know
You can quantify the multiple-testing problem: under independence with k tests at α each, FWER = — at k = 20, α = 0.05, that's ~64%; at k = 100, ~99%. The naive procedure stops being a 5%-Type-I procedure the moment you test more than one thing.
You can derive and apply the THREE classical FWER controllers. BONFERRONI: test at α/k, controls FWER under arbitrary dependence by the union bound, weakest in power. ŠIDÁK: per-test α = , exact FWER under independence, marginally better than Bonferroni. HOLM-BONFERRONI 1979: sort p's, reject p_(j) if p_(j) < α/(k-j+1), stop at the first failure; strictly more powerful than Bonferroni, same FWER ≤ α, same generality. The §2.5 verdict: use Holm in place of Bonferroni whenever FWER control is the goal.
You can derive and apply the BENJAMINI-HOCHBERG 1995 procedure for FDR control. Sort ascending; find the largest j* with p_(j*) ≤ (j*/k)·α; reject all of ranks 1 through j*. Controls under independence and PRDS dependence. Under arbitrary dependence, use BY 2001 with the line shrunk by . BH is the genomics default because Bonferroni-correcting 20,000 tests at α = 0.05 demands per-gene p < 2.5e-6 — too strict to detect anything but the strongest effects.
You can choose FWER vs FDR by the cost of false positives: FWER when any one false positive is catastrophic (confirmatory drug trials, fundamental-discovery claims, pre-registered primary endpoints); FDR when many discoveries are expected and a controlled FRACTION of false positives is acceptable (genomics, neuroimaging, feature ranking). The choice is a statement about the loss function, not about which method is "better".
You can recognise the modern extensions: q-values (Storey 2002) give per-hypothesis FDR analogues; local FDR (Efron 2010) gives posterior-flavoured per-feature credibility; resampling-based corrections (Westfall-Young 1993) exploit correlation structure for higher power; gatekeeping procedures (Hochberg-Tamhane 1987, Bauer 1991) exploit endpoint hierarchies. Each lives in the same conceptual space and each has the same procedural commitment: pre-specify the family, report the correction.
Where this lands in Part 2. §2.6 will work out the pre-registration mechanics that make the "pre-specify the family" commitment operational — what goes in the analysis plan, how to register it, what counts as a deviation. §2.7 covers EQUIVALENCE TESTING (TOST, Schuirmann 1987) — the right tool for supporting "no meaningful effect" when a non-significant frequentist test cannot. §2.8 the REPLICATION CRISIS — what happens to a literature when these structural failures (single-test misinterpretation, p-hacking, multiple-testing inflation, post-hoc analysis) compound across studies.
References
- Bonferroni, C.E. (1936). "Teoria statistica delle classi e calcolo delle probabilità." Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8, 3–62. (The origin of the union-bound multiple-testing correction. Bonferroni's actual inequalities (the first and second Bonferroni inequalities) generalize Boole's inequality and underpin the α/k correction.)
- Šidák, Z. (1967). "Rectangular confidence regions for the means of multivariate normal distributions." Journal of the American Statistical Association 62(318), 626–633. (Šidák's exact-under-independence correction .)
- Holm, S. (1979). "A simple sequentially rejective multiple test procedure." Scandinavian Journal of Statistics 6(2), 65–70. (Holm-Bonferroni: the step-down procedure that is strictly more powerful than Bonferroni under any dependence structure.)
- Benjamini, Y., Hochberg, Y. (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing." Journal of the Royal Statistical Society Series B 57(1), 289–300. (The foundational FDR paper; the BH step-up procedure; the proof that BH controls FDR under independence.)
- Benjamini, Y., Yekutieli, D. (2001). "The control of the false discovery rate in multiple testing under dependency." Annals of Statistics 29(4), 1165–1188. (BY 2001: extends BH to positive regression dependence on subsets (PRDS); for arbitrary dependence, derives the more conservative BY procedure.)
- Storey, J.D. (2002). "A direct approach to false discovery rates." Journal of the Royal Statistical Society Series B 64(3), 479–498. (Storey's q-value: the per-hypothesis FDR analogue of the p-value; also estimates to sharpen BH.)
- Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press. (The canonical modern treatment of local-FDR and empirical-Bayes methods for high-throughput testing; underpins much of modern genomics analysis.)
- Westfall, P.H., Young, S.S. (1993). Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment. Wiley. (Permutation- and bootstrap-based step-down procedures that exploit the dependence structure of the test statistics for power gains over Bonferroni/Holm.)
- Hochberg, Y., Tamhane, A.C. (1987). Multiple Comparison Procedures. Wiley. (The standard reference for FWER procedures, including closed-testing, Hochberg's step-up (1988), and gatekeeping procedures for hierarchical hypothesis families.)
- Marcus, R., Peritz, E., Gabbai, K.R. (1976). "On closed testing procedures with special reference to ordered analysis of variance." Biometrika 63(3), 655–660. (Closed-testing principle: the framework that proves Holm-Bonferroni controls FWER and underpins most modern step-down procedures.)
- Benjamini, Y., Hochberg, Y. (2000). "On the adaptive control of the false discovery rate in multiple testing with independent statistics." Journal of Educational and Behavioral Statistics 25(1), 60–83. (Adaptive BH that estimates from the data for sharper power; precursor of Storey's q-value.)
- Bauer, P. (1991). "Multiple testing in clinical trials." Statistics in Medicine 10(6), 871–890. (Gatekeeping procedures for primary/secondary endpoint hierarchies in regulatory clinical trials.)
- Hommel, G. (1988). "A stagewise rejective multiple test procedure based on a modified Bonferroni test." Biometrika 75(2), 383–386. (Hommel's procedure: a refinement of Holm-Bonferroni that is uniformly more powerful under positive dependence among the test statistics.)
- Hochberg, Y. (1988). "A sharper Bonferroni procedure for multiple tests of significance." Biometrika 75(4), 800–802. (Hochberg's step-up FWER procedure: a step-up analogue of Holm, valid under independence and positive dependence.)
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 10 §10.7 is a clean one-section overview of the multiple-testing problem and BH FDR for a stats audience.)