Proxy Variables and Disparate-Impact Testing · The Data Substrate for AI in Finance

Your model never sees race, sex, age, or national origin. You stripped those fields out of the training data and you can prove it. The problem is that a dozen other features carry the same signal anyway, and your model will learn it whether you meant it to or not.

This is the core failure mode of fair lending in machine learning. A proxy variable is a feature that correlates strongly enough with a protected class that the model can reconstruct membership from it. ZIP code is the textbook case. It is also one of hundreds of fields a modern underwriting or fraud model ingests without anyone tracing where the correlation lives.

The previous lesson treated lineage and provenance as infrastructure. This lesson is about what you do once you can see the data clearly: find the features that quietly stand in for a protected class, and measure the gap they produce in outcomes.

What counts as a proxy

A proxy is not a feature you chose for its protected-class content. It is a feature you chose for a legitimate reason that happens to be predictive of protected-class membership as a side effect.

Geography is the most common. A ZIP code or census block group stratifies the population by income, race, and ethnicity because American neighborhoods are segregated. The model does not know it is learning race. It learns that applicants from certain blocks default more often, and the block correlates with race, so the score moves with race.

Other proxies are less obvious. Surname carries ethnicity. First name carries both sex and ethnicity. Thin-file or no-file credit status correlates with age and with recent immigration. Account-behavior features, the type of device, the time of day someone applies, the merchant categories in their transaction history, all leak demographic signal in ways that are hard to predict in advance.

The danger is not the single proxy you can name. It is the interaction of several weak proxies that together reconstruct the protected attribute the model was never given.

A neural net or gradient-boosted model is very good at finding those interactions. That is the whole point of it. So you cannot reason your way to a clean feature set by inspection. You have to test.

Disparate impact, and the standard test

Disparate impact is a facially neutral practice that produces a worse outcome for a protected group, regardless of intent. The classic screening heuristic comes from the EEOC's 1978 Uniform Guidelines on Employee Selection Procedures: the four-fifths rule, also called the 80 percent rule.

The rule compares selection rates across groups. If the approval rate for a protected group is less than 80 percent of the rate for the most-favored group, that is a flag worth investigating. It is a screen, not a verdict. A result under 80 percent indicates potential adverse impact, not proof of discrimination, and the threshold has well-documented statistical weaknesses on small samples.

A worked example

Say your model approves 1,000 applicants from Group A and 1,000 from Group B.

Group A: 600 approvals, a 60 percent approval rate.
Group B: 420 approvals, a 42 percent approval rate.

The ratio is 42 / 60 = 0.70, or 70 percent. That falls below the 80 percent threshold, so the model flags for adverse impact. Now the real work starts, because the four-fifths rule tells you a gap exists but not why.

You re-run the analysis holding the legitimate credit factors constant. If the gap collapses once you control for income, debt-to-income, and payment history, the disparity may be explained by genuine risk. If a meaningful gap survives those controls, you look for the proxy. You drop suspect features one at a time, or you regress the protected-class proxy against each feature, and you watch the disparity move. The feature whose removal closes the gap is your proxy.

You usually do not have the protected attribute

Here is the practical wrinkle. Lenders are generally prohibited from collecting applicant race and ethnicity for non-mortgage credit, so you cannot just group by a field that does not exist in your data.

The industry answer is a proxy for the proxy. The CFPB uses Bayesian Improved Surname Geocoding, or BISG, developed by a RAND team led by statistician Marc Elliott. BISG combines the race and ethnicity distribution of an applicant's surname with the demographics of their census block group into a single probability that the applicant belongs to each group. A first-name variant, BIFSG, adds first-name information for better accuracy.

Use it with care. BISG was designed to measure disparities across large groups, not to guess any individual's race, and it is meaningfully less accurate for Black and Hispanic applicants and in racially diverse neighborhoods. Recent research shows that running fairness audits on a noisy race proxy can distort the very disparity estimate you are trying to measure. The estimate is a starting point for investigation, not a number to report as ground truth.

The legal ground just moved

If you operate in the United States, the rules around this changed in 2026. The CFPB finalized amendments to Regulation B (the rule implementing the Equal Credit Opportunity Act), effective July 21, 2026, that remove all "effects test" references and take the position that ECOA does not authorize disparate-impact liability.

Under the amended rule, a facially neutral feature is actionable under ECOA only where it functions as a proxy that is intentionally designed or applied to advantage or disadvantage a protected group. Statistical disparity alone no longer establishes an ECOA violation at the federal level.

Do not read that as permission to stop testing. The Fair Housing Act still imposes disparate-impact liability on residential mortgage lending, and several state fair lending laws keep the effects test alive. The continued reach of the FHA substantially narrows the practical effect of the ECOA change in the mortgage market. Reputational and model-risk exposure does not track the CFPB's enforcement posture either. The testing you build is the only way you will ever know what your model is actually doing to whom.

What to build into the pipeline

Treat proxy detection and disparate-impact testing as a standing gate, not a one-time audit before launch.

Run the four-fifths screen and a controlled disparity analysis on every candidate model and on the production model at a fixed cadence, using BISG or BIFSG proxies where the protected attribute is unavailable.
Maintain a documented list of features you flagged as proxies and the decision you made on each, so a reviewer can trace why a feature stayed in.
When a feature drives a disparity that legitimate risk factors do not explain, that is the trigger for the next module: searching for a less-discriminatory alternative before you ship.

The takeaway is plain. A clean feature list is not the absence of protected attributes. It is the demonstrated absence of features that reconstruct them, and you only earn that demonstration by testing for it on every model, every time.

← Previous

Consent and Purpose Limitation: The Input You Are Not Allowed to Use

Finding a Fairer Model That Still Performs