Capstone: ML deployment with fairness audit

Part 10, Real-research capstones

Learning objectives

Train and audit a binary classifier with a sensitive group attribute
Compute statistical-parity, equalized-odds, and per-group calibration metrics
Trace the fairness-accuracy frontier across decision thresholds
Recognise that no single classifier can simultaneously satisfy ALL fairness definitions; choose deliberately based on application context

The fifth capstone audits a deployed binary classifier, a lending model, for GROUP FAIRNESS. The setup: customers from two groups (A and B) with a continuous credit score; the model predicts P(default). Audit questions: is performance equal across groups? Are decisions equally good for the same risk profile? Does the model perpetuate or amplify historical disparities?

Three (often-incompatible) fairness definitions

Statistical parity: P(Ŷ = 1 | group = A) = P(Ŷ = 1 | group = B). Equal positive-prediction rates across groups. Useful when groups are presumed to be "equivalent" pools.
Equalized odds (Hardt, Price & Srebro 2016): P(Ŷ = 1 | Y = y, group = A) = P(Ŷ = 1 | Y = y, group = B) for y ∈ {0, 1}. Equal TPR and FPR across groups, the classifier treats both groups equally given their TRUE outcome.
Calibration / predictive parity: P(Y = 1 | Ŷ = p, group = A) = P(Y = 1 | Ŷ = p, group = B). For ANY predicted probability p, the actual default rate is the same regardless of group.

Chouldechova (2017) and Kleinberg, Mullainathan & Raghavan (2017) proved an IMPOSSIBILITY result: when base rates of Y differ across groups, you cannot simultaneously satisfy all three. You must choose. This is the deepest tension in algorithmic fairness.

The fairness-accuracy frontier

Varying the decision threshold τ traces a trade-off curve:

Low τ → high TPR (catch more defaults) but high FPR (more false alarms). At extreme low τ, EVERYONE is positive, no inequalities, no information.
High τ → low FPR (few false alarms) but high FNR (miss many defaults). Models a "conservative" lender.

If the model is poorly calibrated across groups (one group has lower AUC), the frontier shows a hard trade-off: minimising the equalized-odds gap requires accepting accuracy loss. Group-specific thresholds (post-hoc reweighting) can sometimes recover both, at the cost of explicitly using the sensitive attribute in deployment.

Calibration plots: the most-overlooked fairness diagnostic

For each group, bin predicted probabilities and plot bin-mean vs observed-rate. Perfect calibration = points on y = x. Common failure: predicted probabilities for one group are systematically too high (overpredicted defaults) or too low. This is a CALIBRATION FAILURE that no aggregate-fairness statistic catches, yet it is exactly what determines whether downstream decisions are right.

Try it

Defaults (rate_A = 0.3, rate_B = 0.5, equal score-discrim, threshold = 0.5). Statistical-parity gap and equalized-odds gaps are present but moderate. Calibration plot: both group curves track the diagonal.
Set score-discrim A = 2.5 (high) and B = 0.5 (low). Group A's AUC is high (good ranking); B's AUC is poor. The fairness-accuracy frontier shows the trade-off can't be eliminated by re-thresholding, the FUNDAMENTAL information advantage to A is structural.
Slide threshold up to 0.7. Statistical-parity gap shrinks (overall positives reduced); equalized-odds gap may widen or narrow depending on group risk profiles. Inspect the readouts.
Equal base rates (set both to 0.3). Equalized odds is achievable simultaneously with calibration. Different base rates make joint satisfaction provably impossible, the Chouldechova / Kleinberg impossibility.
Crank n to 2000. Most metrics tighten, but the FUNDAMENTAL trade-off remains. Algorithmic fairness is about CHOOSING which inequality to accept, not eliminating all of them.

A regulator demands equalized odds. The current model satisfies calibration but violates equalized odds. The data scientist proposes group-specific thresholds (lower τ for group B). What are the LEGAL / ETHICAL implications of this fix in a US-disparate-treatment context, and what alternatives might be more defensible?

What you now know

You can audit a deployed ML classifier across the standard fairness metrics. You understand that fairness definitions can conflict, choosing which to optimise is a values judgment, not a technical one. You know that calibration is often the most diagnostically important property for downstream decisions, and that post-hoc fixes (group-specific thresholds, reject-option classification, calibration-aware training) all involve trade-offs that should be made explicit in deployment documentation.

References

Hardt, M., Price, E., Srebro, N. (2016). "Equality of opportunity in supervised learning." NIPS.
Chouldechova, A. (2017). "Fair prediction with disparate impact." Big Data 5(2), 153-163. (Impossibility theorem.)
Kleinberg, J., Mullainathan, S., Raghavan, M. (2017). "Inherent trade-offs in the fair determination of risk scores." ITCS.
Barocas, S., Hardt, M., Narayanan, A. (2019). Fairness and Machine Learning. Online textbook fairmlbook.org.
Mitchell, S. et al. (2021). "Algorithmic fairness: Choices, assumptions, and definitions." Annual Review of Statistics 8, 141-163.