Reporting an ML result so the reader can trust it
Learning objectives
- List the SIX MANDATORY components of an ML report (data, model, evaluation, fairness, deployment, ethics)
- Recognise MODEL CARDS (Mitchell et al. 2019) and DATASHEETS (Gebru et al. 2018) as standard formats
- Document data sources, preprocessing, splits, and treatments to ensure reproducibility
- Report per-group performance + uncertainty quantification (CIs, calibration)
- Document limitations, scope of validity, and intended uses
An ML model is only useful if reviewers can EVALUATE it. Modern responsible-AI practice has converged on standardised reporting templates (MODEL CARDS, DATASHEETS) that document everything needed for an informed assessment. §9.7 develops the modern reporting checklist and gives the practical workflow for trustworthy ML deployment.
The six mandatory sections
- Data: sources, time range, demographic composition, train/val/test split sizes, class balance, missing-data treatment, pre-processing pipeline (applied INSIDE CV folds for no leakage).
- Model: model class, hyperparameters, hyperparameter selection procedure, random seeds, reproducibility instructions.
- Evaluation: discrimination metrics (AUC, accuracy, F1, etc.), calibration (reliability diagram, Brier, ECE), confidence intervals or bootstrap SEs, comparison to baselines.
- Fairness: per-group accuracy + calibration; chosen fairness criterion + justification; mitigation applied.
- Deployment: out-of-distribution failure modes, monitoring plan, intended uses, scope of validity.
- Ethics: stakeholder consultation, risks of automation, mitigation strategies.
Model Cards (Mitchell et al. 2019)
Google's Model Card template provides a standard structure:
- Model details (type, version, owners)
- Intended uses + out-of-scope uses
- Training data + evaluation data (with demographics)
- Per-subgroup performance metrics + uncertainty
- Ethical considerations + recommendations + caveats
Adopted by Hugging Face Model Hub, Microsoft Azure ML, and many corporate ML platforms. The format is publishable as a markdown document accompanying the model.
Datasheets (Gebru et al. 2018)
The complementary template for DATASETS:
- Motivation: why was the dataset created?
- Composition: what does the data contain? Distribution across subgroups?
- Collection: how was it gathered? Sampling biases?
- Pre-processing: cleaning, labeling, transformations
- Uses: intended + unintended use cases
- Distribution: how is the dataset shared? Licensing?
Datasheets + Model Cards together provide end-to-end documentation: from raw data to deployed model.
Reporting reproducibility
Reproducibility crisis applies to ML: many published results don't reproduce because authors don't specify random seeds, hyperparameter selection procedure, or full data preprocessing. Modern requirements:
- Random seed for every random component
- Full hyperparameter values + selection procedure
- Software/library versions
- Computational environment (GPU/CPU, memory)
- Wall-clock training time
Papers with Code, OpenReview, and many journals now require these. The Reproducibility Checklist (NeurIPS, ICML) is a standard tool.
Reporting uncertainty
Modern ML reporting includes uncertainty quantification:
- CIs on metrics: bootstrap CIs on AUC, calibration error, etc.
- Conformal prediction sets: per-prediction uncertainty.
- Calibration diagnostics: reliability diagrams.
- Per-group performance: not just aggregate.
Without uncertainty, reported metrics are misleading. A model with 92% accuracy ± 1% is meaningfully different from 92% ± 5%.
Try it
- Tick the items you would include in YOUR next ML report. The completeness percentage updates in real time.
- For each unticked item, ask: WHY did I not include this? Is it because it is not relevant, or because I overlooked it?
- Compare to a recent ML paper or product: did it cover all 20 items? Most do not — modern best practice is still being established.
- Note the SIX CATEGORIES (Data, Model, Evaluation, Fairness, Deployment, Ethics). A balanced report covers all six.
A researcher publishes an ML model achieving 90% accuracy on a clinical dataset. What MUST the paper include for a reviewer to evaluate whether the result is trustworthy and the model deployable?
What you now know
Modern ML reports follow standard templates (Model Cards, Datasheets) covering data, model, evaluation, fairness, deployment, and ethics. Reproducibility requires random seeds + hyperparameters + software versions. Uncertainty (CIs, calibration, per-group performance) is essential. §9.8 closes Part 9 with WHEN NOT TO USE ML — the meta-question of when other tools are more appropriate.
References
- Mitchell, M., et al. (2019). "Model cards for model reporting." FAccT. (Model Cards.)
- Gebru, T., et al. (2018). "Datasheets for datasets." arXiv:1803.09010.
- Pineau, J., et al. (2021). "Improving reproducibility in machine learning research." JMLR 22(164), 1–20.
- Heil, B.J., et al. (2021). "Reproducibility standards for machine learning in the life sciences." Nature Methods 18, 1132–1135.
- Raji, I.D., et al. (2020). "Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing." FAccT.