Capstone 1 — a designed RCT, end to end

Part 10 — Real-research capstones

Learning objectives

Execute a complete RCT workflow: power analysis → preregistration → data simulation → confirmatory analysis → robustness checks → write-up
Compute the sample size REQUIRED to detect a target effect at chosen α and power
Run a confirmatory t-test, report effect estimate + CI, and apply Bonferroni correction for K outcomes
Perform ROBUSTNESS checks: variance changes, outliers, distribution-shape sensitivity, intent-to-treat vs per-protocol
Write a brief preregistration-compatible result section that distinguishes confirmatory from exploratory findings

This capstone walks through a complete RCT from study design through publication, applying Parts 1–4 (estimation, hypothesis testing, CIs, regression) and Parts 9 (ML for researchers). The scenario: a hypothetical educational study tests whether a new instructional method improves standardised test scores. The capstone covers six steps: power analysis, preregistration, data collection (simulated), confirmatory analysis, robustness checks, and write-up. The integrated widget below runs steps 1–3; the prose covers steps 4–6.

Step 1 — Power analysis BEFORE data collection

An RCT with insufficient power is wasteful: it cannot reliably detect the effect it was designed to test. Power analysis sets the required sample size:

N_{\text{per group}} = \left(\frac{(z_{1-\alpha/2} + z_{1-\beta}) \sigma}{d}\right)^2 \cdot 2,

where $d$ is the target effect size (Cohen's $d = (\mu_T - \mu_C)/\sigma$ ), $\alpha$ the significance level, and $1 - \beta$ the desired power. For our hypothetical study: target $d = 0.4$ (medium effect), $\alpha = 0.05$ two-sided, power $= 0.80$ . The widget computes $N \approx 100$ per group as the required sample size.

Step 2 — Preregistration (analysis lock)

BEFORE data collection, pre-register the analysis plan: primary hypothesis (one-tailed alternative $\mu_T > \mu_C$ ), primary outcome (standardised test score), inclusion / exclusion criteria, primary analysis (independent two-sample t-test, equal variances), secondary outcomes (with multiple-testing correction), interim analyses (none) and stopping rules. Once locked, post-hoc deviations must be reported as exploratory.

Pre-registration prevents p-hacking, the garden of forking paths, and confirmation bias. It also makes the result reproducible: the analysis is determined by the pre-registration, not by the data.

Step 3 — Confirmatory analysis

Run the pre-registered analysis on the collected (or simulated) data. The widget shows: histograms of control and treatment groups, sample means $\bar{x}_C, \bar{x}_T$ , t-statistic, p-value, and 95% CI for the mean difference $\Delta = \bar{x}_T - \bar{x}_C$ . If $p < 0.05$ , reject $H_0$ ; if $p \ge 0.05$ , fail to reject (but DO NOT claim "no effect" — a non-significant result is consistent with a range of effect sizes).

For multiple primary or secondary outcomes (K tests), apply Bonferroni correction: declare significance at $\alpha / K$ . The widget's Step 3 panel shows Monte Carlo family-wise error rates: uncorrected rises with K to ~Kα (way above target), Bonferroni stays at α (correct).

Step 4 — Robustness checks

Variance assumption: re-run with Welch's t-test (unequal variances) and compare. If results differ qualitatively, report both.
Outliers: identify with the 1.5 × IQR rule; re-run with and without them; report sensitivity.
Distribution shape: run the non-parametric Wilcoxon rank-sum as a sanity check (§8.4). Concordance with the t-test increases confidence; discordance suggests distributional issues.
Intent-to-treat vs per-protocol: ITT analyses all randomised subjects regardless of compliance; PP analyses only compliant subjects. Report both if compliance was imperfect.
Subgroup analyses (exploratory): split by demographic variables; report as exploratory with appropriate corrections and explicit pre-registration violation noted.

Step 5 — Effect-size + uncertainty communication

Report the EFFECT SIZE ( $d$ ), not just statistical significance. A p-value of 0.001 with $d = 0.05$ is statistically significant but practically negligible. The 95% CI on the effect size communicates uncertainty better than a single p-value.

Step 6 — Write-up template

A pre-registration-compatible result section follows this structure:

Restate the pre-registered primary hypothesis and analysis.
Report the primary result: $\Delta = \bar{x}_T - \bar{x}_C \pm \text{SE}$ (or 95% CI), $t$ -statistic, $p$ -value, effect size $d$ .
Report secondary outcomes with multiple-testing correction.
Report ALL robustness checks (Step 4) and clearly mark exploratory analyses.
Discuss the result in terms of EFFECT SIZE (practical significance) AND statistical significance.
Acknowledge limitations: external validity, generalisability beyond the study population.

Try it

Defaults: effect size d = 0.40, N = 80, α = 0.05, K = 1. Power(N=80, d=0.40) ≈ 0.71 — UNDERPOWERED for the standard 0.80 target. Required N for power 0.80 is ≈ 100 per group (shown in readout). The dataset shows histograms with treatment skewed slightly higher; t-test typically gives p ≈ 0.05–0.15 (depending on seed).
Increase N to 150. Power jumps to ~0.94; the t-test now reliably detects the effect (p typically < 0.01). The 95% CI on Δ shrinks. This is the power dividend of larger samples.
Drop d to 0.15 (small effect). Power(N=150) drops to ~0.31; many resamples now produce non-significant p-values even though the true effect is non-zero. This is the canonical "underpowered" regime where most published results are unreliable.
Set K = 5 (5 simultaneous tests). The uncorrected FWER bar shoots up to ~0.23 — under H₀, 23% of studies would falsely report at least one "significant" finding. Bonferroni stays at α = 0.05 by tightening the per-test threshold to 0.01.
Set K = 20. Uncorrected FWER nears 0.64 (almost certain false positives); Bonferroni holds at 0.05. The cost: Bonferroni reduces power for each individual test (smaller per-test α).
Click Resample. The data changes; if your study is borderline-powered, the p-value swings wildly from significant to non-significant across resamples. Replication is unreliable in underpowered regimes.

A colleague reports a teaching-method RCT with N = 25 per group, p = 0.04, and claims "the new method works". What three questions would you ask before believing this claim?

What you now know

A well-designed RCT walks through power analysis → preregistration → confirmatory analysis → robustness checks → write-up. Power analysis sets the sample size; preregistration locks the analysis plan; the confirmatory analysis is the planned t-test (with multiple-testing correction if K > 1); robustness checks confirm the result under data-assumption perturbations; the write-up reports effect size, CI, and clearly marks exploratory analyses. This is honest, reproducible research.

References

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge. (The reference for power analysis.)
Schulz, K.F., Altman, D.G., Moher, D. (2010). "CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials." BMJ 340. (Standard RCT reporting framework.)
Nosek, B.A., Ebersole, C.R., DeHaven, A.C., Mellor, D.T. (2018). "The preregistration revolution." PNAS 115(11), 2600–2606.
Gelman, A., Loken, E. (2014). "The garden of forking paths." American Statistician 68(3), 232–236. (Why p-hacking is so easy without preregistration.)
Bonferroni, C.E. (1936). "Teoria statistica delle classi e calcolo delle probabilità." Original Bonferroni-correction paper.