Reporting an ML result so the reader can trust it

Part 9, Machine learning for researchers

Learning objectives

List the SIX MANDATORY components of an ML report (data, model, evaluation, fairness, deployment, ethics)
Recognise MODEL CARDS (Mitchell et al. 2019) and DATASHEETS (Gebru et al. 2018) as standard formats
Document data sources, preprocessing, splits, and treatments to ensure reproducibility
Report per-group performance + uncertainty quantification (CIs, calibration)
Document limitations, scope of validity, and intended uses

An ML model is only useful if reviewers can EVALUATE it. Modern responsible-AI practice has converged on standardised reporting templates (MODEL CARDS, DATASHEETS) that document everything needed for an informed assessment. §9.7 develops the modern reporting checklist and gives the practical workflow for trustworthy ML deployment.

The six mandatory sections

Data: sources, time range, demographic composition, train/val/test split sizes, class balance, missing-data treatment, pre-processing pipeline (applied INSIDE CV folds for no leakage).
Model: model class, hyperparameters, hyperparameter selection procedure, random seeds, reproducibility instructions.
Evaluation: discrimination metrics (AUC, accuracy, F1, etc.), calibration (reliability diagram, Brier, ECE), confidence intervals or bootstrap SEs, comparison to baselines.
Fairness: per-group accuracy + calibration; chosen fairness criterion + justification; mitigation applied.
Deployment: out-of-distribution failure modes, monitoring plan, intended uses, scope of validity.
Ethics: stakeholder consultation, risks of automation, mitigation strategies.

Model Cards (Mitchell et al. 2019)

Google's Model Card template provides a standard structure:

Model details (type, version, owners)
Intended uses + out-of-scope uses
Training data + evaluation data (with demographics)
Per-subgroup performance metrics + uncertainty
Ethical considerations + recommendations + caveats

Adopted by Hugging Face Model Hub, Microsoft Azure ML, and many corporate ML platforms. The format is publishable as a markdown document accompanying the model.

Datasheets (Gebru et al. 2018)

The complementary template for DATASETS:

Motivation: why was the dataset created?
Composition: what does the data contain? Distribution across subgroups?
Collection: how was it gathered? Sampling biases?
Pre-processing: cleaning, labeling, transformations
Uses: intended + unintended use cases
Distribution: how is the dataset shared? Licensing?

Datasheets + Model Cards together provide end-to-end documentation: from raw data to deployed model.

Reporting reproducibility

Reproducibility crisis applies to ML: many published results don't reproduce because authors don't specify random seeds, hyperparameter selection procedure, or full data preprocessing. Modern requirements:

Random seed for every random component
Full hyperparameter values + selection procedure
Software/library versions
Computational environment (GPU/CPU, memory)
Wall-clock training time

Papers with Code, OpenReview, and many journals now require these. The Reproducibility Checklist (NeurIPS, ICML) is a standard tool.

Reporting uncertainty

Modern ML reporting includes uncertainty quantification:

CIs on metrics: bootstrap CIs on AUC, calibration error, etc.
Conformal prediction sets: per-prediction uncertainty.
Calibration diagnostics: reliability diagrams.
Per-group performance: not just aggregate.

Without uncertainty, reported metrics are misleading. A model with 92% accuracy ± 1% is meaningfully different from 92% ± 5%.

Try it

Tick the items you would include in YOUR next ML report. The completeness percentage updates in real time.
For each unticked item, ask: WHY did I not include this? Is it because it is not relevant, or because I overlooked it?
Compare to a recent ML paper or product: did it cover all 20 items? Most do not, modern best practice is still being established.
Note the SIX CATEGORIES (Data, Model, Evaluation, Fairness, Deployment, Ethics). A balanced report covers all six.

A researcher publishes an ML model achieving 90% accuracy on a clinical dataset. What MUST the paper include for a reviewer to evaluate whether the result is trustworthy and the model deployable?

What you now know

Modern ML reports follow standard templates (Model Cards, Datasheets) covering data, model, evaluation, fairness, deployment, and ethics. Reproducibility requires random seeds + hyperparameters + software versions. Uncertainty (CIs, calibration, per-group performance) is essential. §9.8 closes Part 9 with WHEN NOT TO USE ML, the meta-question of when other tools are more appropriate.

References

Mitchell, M., et al. (2019). "Model cards for model reporting." FAccT. (Model Cards.)
Gebru, T., et al. (2018). "Datasheets for datasets." arXiv:1803.09010.
Pineau, J., et al. (2021). "Improving reproducibility in machine learning research." JMLR 22(164), 1-20.
Heil, B.J., et al. (2021). "Reproducibility standards for machine learning in the life sciences." Nature Methods 18, 1132-1135.
Raji, I.D., et al. (2020). "Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing." FAccT.