Finding a Fairer Model That Still Performs · The Data Substrate for AI in Finance

Most fair lending programs stop at the disparity test. They measure whether a model approves protected and control groups at different rates, attach a business-necessity rationale, and file it. That leaves the hardest question unanswered: could a different model have hit the same business target with less disparity?

Under the Equal Credit Opportunity Act and the Fair Housing Act, that question is not optional. Disparate-impact analysis is a three-step framework. The plaintiff shows a policy disproportionately harms a protected group. The lender shows the policy serves a legitimate business need. Then the plaintiff gets to show the same need could be met by a less discriminatory alternative, or LDA. A model can survive the first two steps and still be unlawful at the third.

The prior module covered how to test for disparate impact. This one covers what you owe once you find it: a real search for a fairer model, run and documented before a regulator or plaintiff runs it for you.

Why an alternative usually exists

The premise that makes LDA search worth doing is model multiplicity, also called the Rashomon effect or predictive multiplicity. For most prediction problems there is not one best model. There is a set of models with statistically indistinguishable accuracy that disagree on individual decisions and, more to the point, differ in how they distribute errors across groups.

If a set of equally accurate models exists and they vary in disparity, then picking one without checking fairness is an arbitrary choice. The CFPB has been explicit that fair lending compliance requires lenders to develop a process for considering a range of less discriminatory models, and that searching for and adopting LDAs is a core part of model risk management, not a courtesy.

The practical implication is uncomfortable. "Our model has the highest possible accuracy, so the disparity is unavoidable" is the standard business-necessity argument, and model multiplicity is precisely the evidence that it is usually false.

Two axes of search

A complete search moves along two axes.

Vertical search, within a model

Vertical search holds the model family fixed and varies its construction. You retrain the same architecture across different hyperparameters, regularization strengths, feature subsets, decision thresholds, and random seeds, then plot each result on accuracy against disparity. The Rashomon set is the band of candidates whose accuracy sits within a pre-declared tolerance of your champion. Inside that band you look for the lowest-disparity point.

This is the cheapest search to run because you already own the pipeline. It is also the easiest to defend, because nothing about the inputs or the model type changes.

Horizontal search, across model families

Horizontal search swaps the architecture itself: logistic regression, gradient-boosted trees, a constrained neural network, a fairness-aware learner. Recent work argues that structure-level differences between families often produce larger fairness gains than tuning within one family, which makes horizontal search a sensible early-stage step when you have limited time or compute rather than something you bolt on at the end.

Run vertical search and you prove you optimized your chosen model. Run horizontal search and you prove you did not anchor on a model type that happened to be unfair. Regulators increasingly expect both.

A worked example

Suppose your champion is a gradient-boosted approval model with an AUC of 0.82. Your adverse-impact ratio (protected-group approval rate divided by control-group rate) is 0.74, below the rough 0.80 screening line many fair lending teams use as a flag.

You declare an accuracy tolerance up front: any candidate within 0.5 points of AUC, so 0.815 or higher, counts as performance-equivalent. Then you search.

Vertical sweep, varying tree depth, learning rate, monotonic constraints, and the feature set, surfaces a candidate at AUC 0.818 with an impact ratio of 0.81. Horizontal sweep adds a monotonic logistic model at AUC 0.816 with a ratio of 0.85. Both clear your tolerance. The logistic candidate cuts the disparity most while costing four hundredths of an AUC point you already agreed was immaterial.

The decision is now documented and hard to argue against. You did not trade away meaningful performance. You found a fairer model inside the band you defined as equivalent, and you have the search log to show the champion was a choice, not a necessity.

Make the search defensible

The search only protects you if its rules are set before you see results.

Declare the accuracy tolerance in advance. Decide what counts as performance-equivalent, in a named metric tied to the business outcome (approval volume, expected loss, default rate), before sweeping. A tolerance chosen after seeing candidates is a tolerance reverse-engineered to keep the model you wanted.
Fix the fairness metric to the harm. Use the same disparity measure your disparate-impact test uses, on the same protected classes. A fairer model on a metric nobody challenges is not an LDA.
Search a real range, and log the misses. A sweep of three near-identical configurations is not a search. Record every candidate, including the ones you rejected and why, so the negative result is itself evidence if no LDA exists.
Re-run on a schedule. Populations and data drift. A model with no available LDA at launch can acquire one as the world changes, which is why the CFPB frames LDA testing as a recurring obligation rather than a launch gate.

The takeaway

Less-discriminatory-alternative search turns a legal exposure into an engineering routine. Define your accuracy tolerance, enumerate the Rashomon set of equally good models both within and across families, pick the lowest-disparity point inside it, and keep the log. Either you ship a fairer model that performs, or you produce documented proof that no such model exists. Both outcomes are stronger than the one most programs settle for, which is never having looked.

← Previous

Proxy Variables and Disparate-Impact Testing

Adverse-Action Reasons That Map to What the Model Actually Did