Consent and Purpose Limitation: The Input You Are Not Allowed to Use · The Data Substrate for AI in Finance

A model trained on data you were not allowed to use for that purpose is not a fairness problem you can patch later. It is a defect baked into the substrate, and no amount of disparate-impact testing downstream will clean it. The earlier modules in this course treat lineage and provenance as the infrastructure that tells you where a feature came from. This one is about a narrower question that sits on top of lineage: even if you can prove where a column came from, were you permitted to use it for this?

That permission question has two parts. Consent is whether the consumer agreed to the collection and use. Purpose limitation is whether the use you are now putting the data to is compatible with the purpose it was collected under. They are related but not the same, and in finance they are governed by different rules with different teeth.

Purpose limitation is a property of the data, not the model

The instinct in most ML teams is to treat permitted-use as a legal review that happens once, near launch. By then the data is already in the feature store, already joined, already in three notebooks. Pulling a column at that stage means re-running everything that depended on it.

Treat purpose limitation as a field-level attribute that travels with the data from ingestion. Every column carries a collection purpose and a set of permitted uses. A training pipeline checks those attributes the way it checks types. If a feature's permitted-use set does not include "credit underwriting model" and the model is an underwriting model, the pipeline fails the build. This is the same discipline as point-in-time correctness covered later in the feature-store module: a constraint enforced at read time, not argued about in a meeting.

The rules that actually bind in US finance

Three regimes do most of the work, and they do not overlap cleanly.

The Gramm-Leach-Bliley Act, through Regulation P, governs nonpublic personal information held by financial institutions. Its core mechanic is notice and opt-out for sharing with nonaffiliated third parties. The part that bites model builders is the reuse and redisclosure limit. For most financial institutions this now lives in the CFPB's Regulation P at 12 CFR 1016.11, which took over GLBA privacy rulemaking after Dodd-Frank; the FTC's parallel rule at 16 CFR 313.11 carries the identical limit but applies only to the narrower set of entities still under FTC authority. Under either, when you receive nonpublic personal information from another financial institution under one of the statutory exceptions, you may only use and further disclose it to carry out the activity the exception covered. You do not inherit a free hand to repurpose it.

The Fair Credit Reporting Act governs consumer report data and turns on permissible purpose. Marketing is generally not a permissible purpose for obtaining a consumer report, and the CFPB has moved to make explicit that the "legitimate business need" purpose does not authorize furnishing reports for solicitation. If credit-report-derived attributes feed a model whose output is used for something outside the permissible purpose you obtained them under, the violation attaches to the use, not to the model's accuracy.

The GDPR, where you touch EU data subjects, states the principle most plainly. Article 5(1)(b) requires data be collected for specified, explicit, and legitimate purposes and not further processed in a way incompatible with those purposes. Article 6(4) gives a compatibility test: the link between the original and new purpose, the context of collection, the nature of the data, the possible consequences, and the safeguards in place. Vague or after-the-fact purpose statements have failed that test in recent enforcement. "We always intended to use it for AI" is not a purpose; it is the absence of one.

A worked example: the fraud-to-underwriting leak

Here is the pattern we see most often, stripped to its mechanics.

A bank collects transaction-level data for fraud detection. Consumers were told, accurately, that their transaction patterns are monitored to protect their accounts. The fraud team builds excellent features: merchant-category velocity, device fingerprints, geolocation drift. These features are predictive of more than fraud. They correlate with income stability and spending discipline, which makes them tempting inputs to a credit underwriting model.

A data scientist on the lending side discovers the fraud feature table in the warehouse and joins it in. Validation looks great. The model lifts approval-rate precision by a measurable margin. Nobody lied, nobody stole anything, and the feature store happily served the join.

The defect is that fraud-monitoring consent does not extend to credit decisioning. Under GDPR's compatibility test the consequences for the data subject are materially different: a fraud flag protects you, a credit feature can deny you a loan. Under GLBA and FCRA the analysis is about whether the data's permitted-use set ever included underwriting. In this case it did not, and the model is now built on an input it was never allowed to consume.

The cost of catching this at launch is not just the retraining. It is the adverse-action problem in a later module: if the feature has to come out, every reason code that referenced it has to be regenerated, and any decision already made on the model is exposed.

How to enforce it before training, not after

The fix is structural and cheap if you build it early.

Tag every source field with a collection purpose and an explicit permitted-use list at ingestion, sourced from the consent record or the data-sharing agreement, not from a data scientist's assumption. Make the training job declare its own purpose as a required parameter. Then gate the join: the pipeline computes the intersection of the model's declared purpose against every feature's permitted-use set, and refuses to materialize any feature that fails. Log the check as part of lineage so an auditor can reconstruct, for any model version, exactly which permission each column relied on.

For consent specifically, store the consent as a versioned record with a timestamp, not a boolean. Consent given for fraud monitoring in 2023 is not consent for an underwriting model spun up in 2026, and "consent: true" on a row tells you nothing about scope. The point-in-time discipline from the feature-store module applies here too: you need to know what the consumer agreed to at the moment the data was collected, and whether that scope still covers the use.

The closing takeaway

Purpose limitation fails quietly because the data is real, the model is accurate, and nobody acted in bad faith. The leak happens at the join, where a feature collected for one permitted purpose gets pulled into a model serving another. You cannot test your way out of it downstream, because the problem is not in the predictions; it is in the permission.

Make permitted-use a field-level attribute that travels with every column, make the training pipeline declare and check its own purpose, and version your consent records so scope is auditable to the moment of collection. Do that, and the input you are not allowed to use never reaches the model in the first place, which is the only place the control actually holds.

← Previous

Lineage and Provenance as Infrastructure

Proxy Variables and Disparate-Impact Testing