Master research-analyst workflow card
Learning objectives
- Walk a complete research project from question to publication using the textbook's tools in the right order
- Recognise the decision points where pre-registration, sensitivity analysis, and reporting standards apply
- Apply this workflow as a self-check on any analytical project before commit / publication
This is the closing reference card for the Statistics & Data Science for Researchers textbook. Walk through every research project — RCT, observational, meta-analytic, Bayesian, ML, reproducibility audit — by applying the decision graph below. Each row is a workflow stage; the right column is the textbook section that contains the relevant tools.
Stage 1 — Frame the research question
- What is the QUANTITY OF INTEREST? Mean, proportion, regression coefficient, causal effect, predictive accuracy, posterior probability of a threshold?
- What is the TARGET POPULATION? Inferences only generalise to this population; document its boundaries.
- Is the question CAUSAL or merely DESCRIPTIVE? If causal, build a DAG (§6.1-6.3); without it, regression coefficients are association at best.
- Pre-register the analysis (§10.6 multiverse / reproducibility) — at minimum, the headline analysis. PROSPERO for meta-analyses; OSF / AsPredicted for RCTs and observational studies.
Stage 2 — Design or select the data
- RCT: power analysis (§2.2), randomisation scheme, blocking, blinding.
- Observational: select cohort with documented inclusion criteria, identify confounders (§6.3), plan adjustment strategy.
- Meta-analysis: define inclusion/exclusion criteria (§10.3); register search strategy.
- Reanalysis: archive original dataset with version-control hashes; document provenance.
Stage 3 — Choose the model
- Continuous Y, Normal-residual: OLS (§§4.1-4.7).
- Binary Y: logistic regression (§5.2) + calibration checks (§5.4).
- Counts Y: Poisson / NB (§5.3) with dispersion diagnostic.
- Survival times: Cox PH (§5.6); check PH assumption via Schoenfeld residuals.
- Clustered data: random-effects model (§5.5) with appropriate level structure.
- Non-linear mean structure: GAM (§5.6) with smoothing parameter chosen by CV.
- Heavy-tailed errors: quantile or robust regression (§5.6).
- Causal inference: choose IV / RDD / DiD / propensity-score per identification strategy (§6.4-6.6).
- Sparse data / hierarchical: Bayesian model (§§7.1-7.4) with informative priors.
Stage 4 — Estimate and assess
- Fit the model; report point estimate + 95% CI / CrI.
- Diagnostics (§5.4): residual plots, leverage, Cook's D; flag observations driving the result.
- Calibration (§5.4 for GLMs, §10.5 for ML): predicted vs observed across the predicted range.
- For ML deployment: train/calibrate/test/audit splits (§10.5); fairness audit if relevant.
- Robustness: vary one analytic choice at a time (§10.6). For high-stakes claims: full multiverse.
Stage 5 — Communicate
- Report the QUANTITY ON THE RELEVANT SCALE: odds ratios for logistic, IRRs for Poisson, predictive probabilities for ML deployment. Don't mix log-odds with probabilities.
- Report what is UNKNOWN explicitly: cite priors used, model assumptions, missing-data treatment, exclusion decisions.
- For ML: deployment context, distribution-shift considerations, fairness audit results, calibration curve.
- For causal: sensitivity analysis bounds on unmeasured confounders (§6.6); ALWAYS.
- Pre-register deviations: explicit "this deviates from preregistration because…" in the report.
Stage 6 — Archive and replicate
- Archive analytic code in a public repo (GitHub, OSF) with versioned dependencies.
- Archive de-identified data where ethically permissible.
- Compute environment: containerise (Docker, Apptainer) for reproducibility a year+ later.
- Encourage independent replication; cite replication efforts as scientific peers.
The two big traps to avoid
- Garden of forking paths (§10.6): trying many pipelines and reporting the most significant. Cure: pre-register; report multiverse.
- Mistaking association for causation (§4.8 + §6): regression coefficients DO NOT have causal interpretation unless the DAG identifies the causal effect. Cure: write the DAG; defend it; sensitivity-bound it.
Closing thought
Modern empirical research is a CRAFT. The tools in this textbook are necessary but not sufficient — you must integrate them with domain knowledge, ethical reasoning, and a commitment to honesty about what you don't know. The replication crisis (§10.6) is largely a consequence of treating statistical methods as black boxes rather than as instruments of careful inference. Use them as instruments.
You have completed the Statistics and Data Science for Researchers curriculum. Apply this material to projects you care about; teach it to others; revisit it when a new method emerges. The field will continue to evolve — Bayesian methods will become more accessible, causal inference will continue to mature, ML and statistics will increasingly converge. The fundamentals here will remain.
References — the close-of-textbook reading list
- Wasserman, L. (2004). All of Statistics. Springer. (The compact graduate reference.)
- Hastie, T., Tibshirani, R., Friedman, J. (2009). Elements of Statistical Learning, 2nd ed. (ESL — the unified ML / statistics reference.)
- Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed. Chapman & Hall. (BDA — modern Bayesian practice.)
- Pearl, J., Glymour, M., Jewell, N.P. (2016). Causal Inference in Statistics. Wiley. (Causal inference primer.)
- Imbens, G.W., Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge.
- Barocas, S., Hardt, M., Narayanan, A. (2019). Fairness and Machine Learning. Online: fairmlbook.org. (Algorithmic fairness reference.)
- Wickham, H., Grolemund, G. (2017). R for Data Science. O'Reilly. (Computational practice.)
- Murphy, K.P. (2022/2023). Probabilistic Machine Learning: An Introduction / Advanced Topics. MIT Press. (Modern ML + Bayesian reference.)
- Hernán, M.A., Robins, J.M. (2020). Causal Inference: What If. Online at HSPH. (Practical causal inference for epidemiology + social science.)
- VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly.