Capstone: ML deployment with fairness audit

Part 10 — Real-research capstones

Learning objectives

  • Train and audit a binary classifier with a sensitive group attribute
  • Compute statistical-parity, equalized-odds, and per-group calibration metrics
  • Trace the fairness-accuracy frontier across decision thresholds
  • Recognise that no single classifier can simultaneously satisfy ALL fairness definitions; choose deliberately based on application context

The fifth capstone audits a deployed binary classifier — a lending model — for GROUP FAIRNESS. The setup: customers from two groups (A and B) with a continuous credit score; the model predicts P(default). Audit questions: is performance equal across groups? Are decisions equally good for the same risk profile? Does the model perpetuate or amplify historical disparities?

Three (often-incompatible) fairness definitions

  • Statistical parity: P(Ŷ = 1 | group = A) = P(Ŷ = 1 | group = B). Equal positive-prediction rates across groups. Useful when groups are presumed to be "equivalent" pools.
  • Equalized odds (Hardt, Price & Srebro 2016): P(Ŷ = 1 | Y = y, group = A) = P(Ŷ = 1 | Y = y, group = B) for y ∈ {0, 1}. Equal TPR and FPR across groups — the classifier treats both groups equally given their TRUE outcome.
  • Calibration / predictive parity: P(Y = 1 | Ŷ = p, group = A) = P(Y = 1 | Ŷ = p, group = B). For ANY predicted probability p, the actual default rate is the same regardless of group.

Chouldechova (2017) and Kleinberg, Mullainathan & Raghavan (2017) proved an IMPOSSIBILITY result: when base rates of Y differ across groups, you cannot simultaneously satisfy all three. You must choose. This is the deepest tension in algorithmic fairness.

The fairness-accuracy frontier

Varying the decision threshold τ traces a trade-off curve:

  • Low τ → high TPR (catch more defaults) but high FPR (more false alarms). At extreme low τ, EVERYONE is positive — no inequalities, no information.
  • High τ → low FPR (few false alarms) but high FNR (miss many defaults). Models a "conservative" lender.

If the model is poorly calibrated across groups (one group has lower AUC), the frontier shows a hard trade-off: minimising the equalized-odds gap requires accepting accuracy loss. Group-specific thresholds (post-hoc reweighting) can sometimes recover both, at the cost of explicitly using the sensitive attribute in deployment.

Calibration plots: the most-overlooked fairness diagnostic

For each group, bin predicted probabilities and plot bin-mean vs observed-rate. Perfect calibration = points on y = x. Common failure: predicted probabilities for one group are systematically too high (overpredicted defaults) or too low. This is a CALIBRATION FAILURE that no aggregate-fairness statistic catches — yet it is exactly what determines whether downstream decisions are right.

Ml Fairness Audit DemoInteractive figure — enable JavaScript to interact.

Try it

  • Defaults (rate_A = 0.3, rate_B = 0.5, equal score-discrim, threshold = 0.5). Statistical-parity gap and equalized-odds gaps are present but moderate. Calibration plot: both group curves track the diagonal.
  • Set score-discrim A = 2.5 (high) and B = 0.5 (low). Group A's AUC is high (good ranking); B's AUC is poor. The fairness-accuracy frontier shows the trade-off can't be eliminated by re-thresholding — the FUNDAMENTAL information advantage to A is structural.
  • Slide threshold up to 0.7. Statistical-parity gap shrinks (overall positives reduced); equalized-odds gap may widen or narrow depending on group risk profiles. Inspect the readouts.
  • Equal base rates (set both to 0.3). Equalized odds is achievable simultaneously with calibration. Different base rates make joint satisfaction provably impossible — the Chouldechova / Kleinberg impossibility.
  • Crank n to 2000. Most metrics tighten, but the FUNDAMENTAL trade-off remains. Algorithmic fairness is about CHOOSING which inequality to accept, not eliminating all of them.

A regulator demands equalized odds. The current model satisfies calibration but violates equalized odds. The data scientist proposes group-specific thresholds (lower τ for group B). What are the LEGAL / ETHICAL implications of this fix in a US-disparate-treatment context, and what alternatives might be more defensible?

What you now know

You can audit a deployed ML classifier across the standard fairness metrics. You understand that fairness definitions can conflict — choosing which to optimise is a values judgment, not a technical one. You know that calibration is often the most diagnostically important property for downstream decisions, and that post-hoc fixes (group-specific thresholds, reject-option classification, calibration-aware training) all involve trade-offs that should be made explicit in deployment documentation.

References

  • Hardt, M., Price, E., Srebro, N. (2016). "Equality of opportunity in supervised learning." NIPS.
  • Chouldechova, A. (2017). "Fair prediction with disparate impact." Big Data 5(2), 153–163. (Impossibility theorem.)
  • Kleinberg, J., Mullainathan, S., Raghavan, M. (2017). "Inherent trade-offs in the fair determination of risk scores." ITCS.
  • Barocas, S., Hardt, M., Narayanan, A. (2019). Fairness and Machine Learning. Online textbook fairmlbook.org.
  • Mitchell, S. et al. (2021). "Algorithmic fairness: Choices, assumptions, and definitions." Annual Review of Statistics 8, 141–163.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.