Lineage and Provenance as Infrastructure · The Data Substrate for AI in Finance

In the last lesson we argued that data, not the model, is where an audit starts. The natural next question is operational: when an examiner or a plaintiff asks how a specific column reached a specific decision, can you answer it in minutes, or do you convene a working group?

If the answer is a working group, you have lineage as documentation. Documentation goes stale the moment the pipeline changes, and it tends to describe the system someone intended to build rather than the one that ran last Tuesday. The alternative is lineage as infrastructure: a layer that emits records automatically as jobs execute, so the map is generated by the territory instead of drawn alongside it.

Documentation drift is the failure mode

A lineage diagram in a wiki or a model risk template is a snapshot. The pipeline keeps moving. Someone swaps a join, repoints a source table, adds a backfill, deprecates a feature. Each change is small and reasonable, and none of them update the diagram.

By the time you need the lineage, for an adverse-action explanation or a fair-lending review, the gap between the documented flow and the executed flow is exactly the part you cannot account for. That gap is also where regulators look hardest.

This is not a hypothetical standard. US banking guidance has long expected traceability of model inputs. SR 11-7, the model risk guidance the Federal Reserve and OCC issued in 2011, called for controls ensuring input data is accurate, complete, and traceable from source through to model output. On April 17, 2026, the Federal Reserve, OCC, and FDIC superseded SR 11-7 with SR 26-2, a more risk-based and principles-driven framework, but the expectation that you can trace and control model inputs did not go away. It became something you are expected to demonstrate proportionate to the model's risk, not just assert.

What "infrastructure" means here

Treating lineage as infrastructure means three concrete commitments.

Emit at runtime, do not reconstruct

Lineage records should be produced by the jobs themselves as a side effect of running, the same way structured logs are. The orchestrator, the transformation engine, and the query layer each report what they read and wrote. Nobody is asked to remember.

The open standard for this is OpenLineage, a Linux Foundation project that became the common metadata format for lineage collection. It has producer integrations for the tools most finance data teams already run, including Apache Airflow, Apache Spark, dbt, Snowflake, and BigQuery, and consumer integrations such as Marquez and Datadog. The point of adopting a standard is that a Spark job and an Airflow DAG describe themselves in the same vocabulary, so you can stitch a single graph across tools instead of maintaining one diagram per system.

Capture column-level detail, not just table-to-table

Table-level lineage tells you that credit_decisions was built from applications and bureau_pulls. That is not enough to defend a decision. You need to know that the risk_tier output column was derived from applicant_income, revolving_utilization, and a transformed zip3 field, and what the transformation was.

OpenLineage carries this through its column-lineage dataset facet, which records which input columns feed which output columns and how. When a specific integration cannot collect that detail, it omits the facet and still reports the table-level dependency, so you degrade gracefully rather than losing the whole record. Column-level lineage is also the input the rest of this course depends on. Proxy-variable testing (module 4) and adverse-action reasons that map to outputs (module 6) both assume you can name the exact columns behind a score.

Make it queryable and time-aware

A lineage graph you cannot query at a point in time is a poster. You need to ask "as of the day this loan was scored, what fed risk_tier," and get the version of the graph that was live then, not today's. That means versioning lineage events and joining them to the run that produced a given decision, which connects directly to feature-store point-in-time correctness in module 8.

A worked example

A small-business lending model declines an applicant on March 3. Six weeks later the applicant disputes the decision and the fair-lending team needs to reconstruct it.

With lineage as documentation, the team opens the model risk file, finds a flow diagram dated last quarter, and discovers that the industry_code enrichment source was repointed to a new vendor feed in mid-February. The diagram still shows the old source. Now they cannot say with confidence which feed produced the value used on March 3, and the investigation stalls on a question that should have taken minutes.

With lineage as infrastructure, the team queries the lineage store for the scoring run tied to that application ID. The run's OpenLineage events name the exact input datasets and versions, the column-level facet shows risk_tier was derived from income, utilization, industry_code, and zip3, and the event timestamp confirms industry_code came from the new vendor feed that was already live on March 3. They can also see that zip3 fed the score, which immediately flags a geography-adjacent input for the proxy-variable review.

The difference is not polish. One path produces an answer and a next action. The other produces a meeting.

What to instrument first

You do not need full coverage on day one. Start where decisions are made and risk concentrates.

Instrument the pipelines that feed models touching credit, pricing, or eligibility before you worry about internal reporting marts. Turn on column-level lineage for the tables that hold model features and model outputs. Pick one open format and emit to it from every tool rather than buying a separator per system. Store events with versions and timestamps so a past decision resolves to the graph that was actually live then.

Takeaway

Lineage earns the word "infrastructure" only when it runs without anyone maintaining it: emitted at execution, captured to the column, versioned, and queryable at a point in time. Build it that way and every later control in this course, from proxy testing to adverse-action mapping, has the substrate it needs. Build it as documentation and you will rediscover the gap between the diagram and the pipeline at the worst possible moment, in front of someone who is allowed to ask follow-up questions.

← Previous

Why an Examiner Starts With Your Data, Not Your Model

Consent and Purpose Limitation: The Input You Are Not Allowed to Use