Capstone 1 — a designed RCT, end to end
Learning objectives
- Execute a complete RCT workflow: power analysis → preregistration → data simulation → confirmatory analysis → robustness checks → write-up
- Compute the sample size REQUIRED to detect a target effect at chosen α and power
- Run a confirmatory t-test, report effect estimate + CI, and apply Bonferroni correction for K outcomes
- Perform ROBUSTNESS checks: variance changes, outliers, distribution-shape sensitivity, intent-to-treat vs per-protocol
- Write a brief preregistration-compatible result section that distinguishes confirmatory from exploratory findings
This capstone walks through a complete RCT from study design through publication, applying Parts 1–4 (estimation, hypothesis testing, CIs, regression) and Parts 9 (ML for researchers). The scenario: a hypothetical educational study tests whether a new instructional method improves standardised test scores. The capstone covers six steps: power analysis, preregistration, data collection (simulated), confirmatory analysis, robustness checks, and write-up. The integrated widget below runs steps 1–3; the prose covers steps 4–6.
Step 1 — Power analysis BEFORE data collection
An RCT with insufficient power is wasteful: it cannot reliably detect the effect it was designed to test. Power analysis sets the required sample size:
where is the target effect size (Cohen's ), the significance level, and the desired power. For our hypothetical study: target (medium effect), two-sided, power . The widget computes per group as the required sample size.
Step 2 — Preregistration (analysis lock)
BEFORE data collection, pre-register the analysis plan: primary hypothesis (one-tailed alternative ), primary outcome (standardised test score), inclusion / exclusion criteria, primary analysis (independent two-sample t-test, equal variances), secondary outcomes (with multiple-testing correction), interim analyses (none) and stopping rules. Once locked, post-hoc deviations must be reported as exploratory.
Pre-registration prevents p-hacking, the garden of forking paths, and confirmation bias. It also makes the result reproducible: the analysis is determined by the pre-registration, not by the data.
Step 3 — Confirmatory analysis
Run the pre-registered analysis on the collected (or simulated) data. The widget shows: histograms of control and treatment groups, sample means , t-statistic, p-value, and 95% CI for the mean difference . If , reject ; if , fail to reject (but DO NOT claim "no effect" — a non-significant result is consistent with a range of effect sizes).
For multiple primary or secondary outcomes (K tests), apply Bonferroni correction: declare significance at . The widget's Step 3 panel shows Monte Carlo family-wise error rates: uncorrected rises with K to ~Kα (way above target), Bonferroni stays at α (correct).
Step 4 — Robustness checks
- Variance assumption: re-run with Welch's t-test (unequal variances) and compare. If results differ qualitatively, report both.
- Outliers: identify with the 1.5 × IQR rule; re-run with and without them; report sensitivity.
- Distribution shape: run the non-parametric Wilcoxon rank-sum as a sanity check (§8.4). Concordance with the t-test increases confidence; discordance suggests distributional issues.
- Intent-to-treat vs per-protocol: ITT analyses all randomised subjects regardless of compliance; PP analyses only compliant subjects. Report both if compliance was imperfect.
- Subgroup analyses (exploratory): split by demographic variables; report as exploratory with appropriate corrections and explicit pre-registration violation noted.
Step 5 — Effect-size + uncertainty communication
Report the EFFECT SIZE (), not just statistical significance. A p-value of 0.001 with is statistically significant but practically negligible. The 95% CI on the effect size communicates uncertainty better than a single p-value.
Step 6 — Write-up template
A pre-registration-compatible result section follows this structure:
- Restate the pre-registered primary hypothesis and analysis.
- Report the primary result: (or 95% CI), -statistic, -value, effect size .
- Report secondary outcomes with multiple-testing correction.
- Report ALL robustness checks (Step 4) and clearly mark exploratory analyses.
- Discuss the result in terms of EFFECT SIZE (practical significance) AND statistical significance.
- Acknowledge limitations: external validity, generalisability beyond the study population.
Try it
- Defaults: effect size d = 0.40, N = 80, α = 0.05, K = 1. Power(N=80, d=0.40) ≈ 0.71 — UNDERPOWERED for the standard 0.80 target. Required N for power 0.80 is ≈ 100 per group (shown in readout). The dataset shows histograms with treatment skewed slightly higher; t-test typically gives p ≈ 0.05–0.15 (depending on seed).
- Increase N to 150. Power jumps to ~0.94; the t-test now reliably detects the effect (p typically < 0.01). The 95% CI on Δ shrinks. This is the power dividend of larger samples.
- Drop d to 0.15 (small effect). Power(N=150) drops to ~0.31; many resamples now produce non-significant p-values even though the true effect is non-zero. This is the canonical "underpowered" regime where most published results are unreliable.
- Set K = 5 (5 simultaneous tests). The uncorrected FWER bar shoots up to ~0.23 — under H₀, 23% of studies would falsely report at least one "significant" finding. Bonferroni stays at α = 0.05 by tightening the per-test threshold to 0.01.
- Set K = 20. Uncorrected FWER nears 0.64 (almost certain false positives); Bonferroni holds at 0.05. The cost: Bonferroni reduces power for each individual test (smaller per-test α).
- Click Resample. The data changes; if your study is borderline-powered, the p-value swings wildly from significant to non-significant across resamples. Replication is unreliable in underpowered regimes.
A colleague reports a teaching-method RCT with N = 25 per group, p = 0.04, and claims "the new method works". What three questions would you ask before believing this claim?
What you now know
A well-designed RCT walks through power analysis → preregistration → confirmatory analysis → robustness checks → write-up. Power analysis sets the sample size; preregistration locks the analysis plan; the confirmatory analysis is the planned t-test (with multiple-testing correction if K > 1); robustness checks confirm the result under data-assumption perturbations; the write-up reports effect size, CI, and clearly marks exploratory analyses. This is honest, reproducible research.
References
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge. (The reference for power analysis.)
- Schulz, K.F., Altman, D.G., Moher, D. (2010). "CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials." BMJ 340. (Standard RCT reporting framework.)
- Nosek, B.A., Ebersole, C.R., DeHaven, A.C., Mellor, D.T. (2018). "The preregistration revolution." PNAS 115(11), 2600–2606.
- Gelman, A., Loken, E. (2014). "The garden of forking paths." American Statistician 68(3), 232–236. (Why p-hacking is so easy without preregistration.)
- Bonferroni, C.E. (1936). "Teoria statistica delle classi e calcolo delle probabilità." Original Bonferroni-correction paper.