Provenance for RAG: Knowing What the Model Retrieved and From Where · The Data Substrate for AI in Finance

Most teams adopt retrieval-augmented generation (RAG) to stop the model from making things up. The retriever fetches real documents, the model answers from them, and the answer is supposed to be grounded. That framing hides the problem this lesson is about. A RAG system can return the wrong answer for a fully traceable reason, or the right answer for a reason no one can reconstruct later. In a bank, the second case is the dangerous one.

Earlier modules in this course treated lineage and provenance as data infrastructure, and consent and purpose limitation as controls on what you are allowed to use. RAG inherits all of that, then adds a runtime question those modules did not have to answer: for this specific output, on this specific date, what did the model actually see, and does the output follow from it? That is retrieval provenance, and it is a separate discipline from training-data lineage.

Why "grounded" is not the same as "auditable"

Supervisory guidance has been explicit about reconstructability for years. The Federal Reserve and OCC's SR 11-7 (Bulletin 2011-12, issued April 2011) requires that documentation be detailed enough that "parties unfamiliar with a model can understand how the model operates, its limitations, and its key assumptions." A RAG pipeline where the answer depends on whatever the vector index happened to return that afternoon fails this test unless you capture the retrieval event itself.

Grounding is a property of a good answer. Provenance is a property of the record. You can have a grounded answer with no provenance, which means you cannot defend it in a validation review or an adverse-action dispute. You can also have rich provenance attached to a wrong answer, which is uncomfortable but useful, because at least you can see where it broke.

The three failure modes you are actually logging against

RAG fails in three distinct ways, and a provenance record should let you tell them apart after the fact.

Retrieval miss. The document that contained the answer was never returned, so the model answered from a gap.
Context leak. The model ignored the retrieved text and pulled an answer from its parametric memory, which may contradict the source.
Generation drift. The retrieved chunk was correct, but the model paraphrased it into something that means something different.

Without the retrieved set and the per-claim attribution, all three look identical from the outside: a confident answer that turned out to be wrong. With them, they are three different bugs with three different fixes.

What a provenance record has to contain

Treat each RAG response as an event you can replay. At minimum, capture and store the following alongside the output:

The exact user query and any rewritten or expanded query the retriever actually ran.
The retrieved set: document IDs, chunk IDs, the index or embedding model version, and the similarity scores. Page-level or span-level identifiers matter, because "we cited the 10-K" is not defensible, but "we cited the 10-K, FY2025, page 47, the liquidity coverage paragraph" is.
A content hash or version pin for each retrieved chunk, so you can prove the source text has not changed since the answer was generated.
The generation-time configuration: model version, prompt template version, temperature, and any reranker applied.
Statement-level attribution: which sentence in the answer maps to which retrieved chunk.

That last item is the one teams skip and later regret. A single citation footer at the bottom of an answer tells you the model saw four documents. It does not tell you whether the specific claim that triggered a complaint came from a document or from the model's own fabrication.

A worked example

A relationship manager asks an internal assistant, "Can this small-business applicant be offered the cash-flow line at the promotional rate?" The assistant answers yes and cites the current product policy document.

Six weeks later, a reviewer flags that the promotional rate had been withdrawn for that product tier before the query ran. With a thin RAG setup, all you have is the answer text and a document name. You cannot tell whether the retriever pulled a stale cached version, whether the live policy was retrieved but the model paraphrased an exception clause into a blanket yes, or whether the model never read the retrieved text at all.

With a full provenance record, you open the event and see: the retriever returned policy-cashflow-line chunk 12, content hash a91f..., similarity 0.83, index version 2026-04-02. The hash resolves to a snapshot dated before the rate was withdrawn. The retrieval miss is now obvious. The index was stale, the answer was faithful to a document that should no longer have existed, and the fix is a freshness control on ingestion, not a prompt change. That is a 20-minute diagnosis instead of a multi-day argument.

Verifying the answer against its own sources

Capturing the retrieved set is necessary but not sufficient, because it does not prove the answer follows from those sources. This is where faithfulness checks come in. A faithfulness check decomposes the generated answer into atomic claims and tests each one against the retrieved context, classifying it as supported, neutral, or contradicted, often using a natural-language-inference model. The faithfulness score is the proportion of claims that are supported.

Run this as a gate, not a dashboard. If a claim in the answer is not entailed by any retrieved chunk, you have caught a context leak or a generation drift at runtime, before the answer reaches a customer or a credit decision. The frameworks that implement these checks (RAGAS, DeepEval, TruLens, and similar) are mature enough to wire into a pipeline. The harder engineering is storing the per-claim verdict next to the answer so it becomes part of the permanent record, not a transient test result.

Set abstention as a first-class output

A provenance-serious RAG system is allowed to say it does not know. If the retrieved set does not support an answer, returning "insufficient evidence" with the (empty or weak) retrieval record attached is a better outcome than a confident, ungrounded answer. Abstention is cheaper to defend than a wrong adverse-action reason, which the next module on adverse-action mapping depends on.

Closing takeaway

RAG does not make AI in finance auditable on its own. It makes auditability possible, but only if you treat each answer as a replayable event: the query that ran, the chunks that came back with versions and hashes, the model and prompt configuration, and a claim-by-claim record of what the sources actually support. Build that record at generation time. You cannot reconstruct a retrieval event after the index has moved on, and "the model was usually right" is not a control a validator will accept.

← Previous

Adverse-Action Reasons That Map to What the Model Actually Did

Feature Stores and Point-in-Time Correctness: Reproducibility as a Control