Capstone 1 — a designed RCT, end to end

Part 10 — Real-research capstones

Learning objectives

  • Execute a complete RCT workflow: power analysis → preregistration → data simulation → confirmatory analysis → robustness checks → write-up
  • Compute the sample size REQUIRED to detect a target effect at chosen α and power
  • Run a confirmatory t-test, report effect estimate + CI, and apply Bonferroni correction for K outcomes
  • Perform ROBUSTNESS checks: variance changes, outliers, distribution-shape sensitivity, intent-to-treat vs per-protocol
  • Write a brief preregistration-compatible result section that distinguishes confirmatory from exploratory findings

This capstone walks through a complete RCT from study design through publication, applying Parts 1–4 (estimation, hypothesis testing, CIs, regression) and Parts 9 (ML for researchers). The scenario: a hypothetical educational study tests whether a new instructional method improves standardised test scores. The capstone covers six steps: power analysis, preregistration, data collection (simulated), confirmatory analysis, robustness checks, and write-up. The integrated widget below runs steps 1–3; the prose covers steps 4–6.

Step 1 — Power analysis BEFORE data collection

An RCT with insufficient power is wasteful: it cannot reliably detect the effect it was designed to test. Power analysis sets the required sample size:

Nper group=((z1α/2+z1β)σd)22,N_{\text{per group}} = \left(\frac{(z_{1-\alpha/2} + z_{1-\beta}) \sigma}{d}\right)^2 \cdot 2,

where dd is the target effect size (Cohen's d=(μTμC)/σd = (\mu_T - \mu_C)/\sigma), α\alpha the significance level, and 1β1 - \beta the desired power. For our hypothetical study: target d=0.4d = 0.4 (medium effect), α=0.05\alpha = 0.05 two-sided, power =0.80= 0.80. The widget computes N100N \approx 100 per group as the required sample size.

Step 2 — Preregistration (analysis lock)

BEFORE data collection, pre-register the analysis plan: primary hypothesis (one-tailed alternative μT>μC\mu_T > \mu_C), primary outcome (standardised test score), inclusion / exclusion criteria, primary analysis (independent two-sample t-test, equal variances), secondary outcomes (with multiple-testing correction), interim analyses (none) and stopping rules. Once locked, post-hoc deviations must be reported as exploratory.

Pre-registration prevents p-hacking, the garden of forking paths, and confirmation bias. It also makes the result reproducible: the analysis is determined by the pre-registration, not by the data.

Step 3 — Confirmatory analysis

Run the pre-registered analysis on the collected (or simulated) data. The widget shows: histograms of control and treatment groups, sample means xˉC,xˉT\bar{x}_C, \bar{x}_T, t-statistic, p-value, and 95% CI for the mean difference Δ=xˉTxˉC\Delta = \bar{x}_T - \bar{x}_C. If p<0.05p < 0.05, reject H0H_0; if p0.05p \ge 0.05, fail to reject (but DO NOT claim "no effect" — a non-significant result is consistent with a range of effect sizes).

For multiple primary or secondary outcomes (K tests), apply Bonferroni correction: declare significance at α/K\alpha / K. The widget's Step 3 panel shows Monte Carlo family-wise error rates: uncorrected rises with K to ~Kα (way above target), Bonferroni stays at α (correct).

Step 4 — Robustness checks

  • Variance assumption: re-run with Welch's t-test (unequal variances) and compare. If results differ qualitatively, report both.
  • Outliers: identify with the 1.5 × IQR rule; re-run with and without them; report sensitivity.
  • Distribution shape: run the non-parametric Wilcoxon rank-sum as a sanity check (§8.4). Concordance with the t-test increases confidence; discordance suggests distributional issues.
  • Intent-to-treat vs per-protocol: ITT analyses all randomised subjects regardless of compliance; PP analyses only compliant subjects. Report both if compliance was imperfect.
  • Subgroup analyses (exploratory): split by demographic variables; report as exploratory with appropriate corrections and explicit pre-registration violation noted.

Step 5 — Effect-size + uncertainty communication

Report the EFFECT SIZE (dd), not just statistical significance. A p-value of 0.001 with d=0.05d = 0.05 is statistically significant but practically negligible. The 95% CI on the effect size communicates uncertainty better than a single p-value.

Step 6 — Write-up template

A pre-registration-compatible result section follows this structure:

  • Restate the pre-registered primary hypothesis and analysis.
  • Report the primary result: Δ=xˉTxˉC±SE\Delta = \bar{x}_T - \bar{x}_C \pm \text{SE} (or 95% CI), tt-statistic, pp-value, effect size dd.
  • Report secondary outcomes with multiple-testing correction.
  • Report ALL robustness checks (Step 4) and clearly mark exploratory analyses.
  • Discuss the result in terms of EFFECT SIZE (practical significance) AND statistical significance.
  • Acknowledge limitations: external validity, generalisability beyond the study population.

Rct Capstone DemoInteractive figure — enable JavaScript to interact.

Try it

  • Defaults: effect size d = 0.40, N = 80, α = 0.05, K = 1. Power(N=80, d=0.40) ≈ 0.71 — UNDERPOWERED for the standard 0.80 target. Required N for power 0.80 is ≈ 100 per group (shown in readout). The dataset shows histograms with treatment skewed slightly higher; t-test typically gives p ≈ 0.05–0.15 (depending on seed).
  • Increase N to 150. Power jumps to ~0.94; the t-test now reliably detects the effect (p typically < 0.01). The 95% CI on Δ shrinks. This is the power dividend of larger samples.
  • Drop d to 0.15 (small effect). Power(N=150) drops to ~0.31; many resamples now produce non-significant p-values even though the true effect is non-zero. This is the canonical "underpowered" regime where most published results are unreliable.
  • Set K = 5 (5 simultaneous tests). The uncorrected FWER bar shoots up to ~0.23 — under H₀, 23% of studies would falsely report at least one "significant" finding. Bonferroni stays at α = 0.05 by tightening the per-test threshold to 0.01.
  • Set K = 20. Uncorrected FWER nears 0.64 (almost certain false positives); Bonferroni holds at 0.05. The cost: Bonferroni reduces power for each individual test (smaller per-test α).
  • Click Resample. The data changes; if your study is borderline-powered, the p-value swings wildly from significant to non-significant across resamples. Replication is unreliable in underpowered regimes.

A colleague reports a teaching-method RCT with N = 25 per group, p = 0.04, and claims "the new method works". What three questions would you ask before believing this claim?

What you now know

A well-designed RCT walks through power analysis → preregistration → confirmatory analysis → robustness checks → write-up. Power analysis sets the sample size; preregistration locks the analysis plan; the confirmatory analysis is the planned t-test (with multiple-testing correction if K > 1); robustness checks confirm the result under data-assumption perturbations; the write-up reports effect size, CI, and clearly marks exploratory analyses. This is honest, reproducible research.

References

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Routledge. (The reference for power analysis.)
  • Schulz, K.F., Altman, D.G., Moher, D. (2010). "CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials." BMJ 340. (Standard RCT reporting framework.)
  • Nosek, B.A., Ebersole, C.R., DeHaven, A.C., Mellor, D.T. (2018). "The preregistration revolution." PNAS 115(11), 2600–2606.
  • Gelman, A., Loken, E. (2014). "The garden of forking paths." American Statistician 68(3), 232–236. (Why p-hacking is so easy without preregistration.)
  • Bonferroni, C.E. (1936). "Teoria statistica delle classi e calcolo delle probabilità." Original Bonferroni-correction paper.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.