Continuous Monitoring: Drift, Hallucination Rate, and Stability Thresholds · Model Risk Management for LLMs

SR 11-7 has always treated ongoing monitoring as a core pillar of validation, alongside conceptual soundness and outcomes analysis. The guidance frames monitoring as the work of confirming that a model is implemented correctly, used as intended, and still appropriate as products, clients, and market conditions move. For a credit scorecard you own end to end, that work is well understood. For a hosted LLM behind an API, the same obligation lands on a system you cannot retrain, cannot freeze, and often cannot version-pin for more than a few months.

That is the gap this module closes. Validation (module 5) tells you the model was fit for purpose on the day you signed off. Monitoring tells you whether that is still true today, and it has to do so without the labeled outcomes and stable feature space that classical monitoring assumes.

What you are actually monitoring

SR 11-7's monitoring toolkit splits into process verification, benchmarking, and outcomes analysis. All three survive the move to LLMs, but the signals change.

Process verification on a scorecard checks that inputs are accurate and complete. On an LLM it checks that the prompt template, retrieval context, model version, and tool-call surface are what you validated. A silent provider model update or a changed system prompt is the LLM equivalent of a corrupted data feed, and most teams have no alarm for it.

Outcomes analysis is harder because you rarely get a clean label. You are not comparing a forecast to a realized default. You are asking whether an answer was faithful to its source, whether a summary was accurate, whether an extraction matched the document. That forces a shift from accuracy metrics to quality metrics: faithfulness, groundedness, and a measured hallucination rate.

Three signals, three thresholds

We monitor three families of signal, each with its own threshold logic.

Drift

Classical population stability index (PSI) does not apply cleanly because an LLM pipeline has no fixed tabular feature space. The adapted version tracks distribution shift on what the pipeline actually consumes and produces.

The standard PSI bands still anchor the math: a PSI under 0.10 is treated as no material shift, 0.10 to 0.25 as moderate shift that warrants attention, and above 0.25 as significant shift. We apply those bands to bucketed signals rather than raw features: input length and topic mix, retrieval-result distributions, output length, and tool-call frequency.

For free text where bucketing is awkward, embedding drift is the more honest measure. Compute a baseline centroid of prompt embeddings from your validation window, then track cosine distance between that baseline and each monitoring window's centroid. A rising distance means the live traffic no longer looks like what you validated on, which is exactly the SR 11-7 trigger for asking whether the model is being used outside its tested scope.

Hallucination rate

Hallucination rate is the share of outputs that assert something unsupported by the provided context or known ground truth. There is no regulator-mandated number here, so do not invent one or borrow one from a vendor blog. You set the threshold from your validation baseline.

The practical method is a sampled LLM-as-judge or human review on a fixed slice of daily traffic, scoring each output for groundedness, then tracking the rate as a control chart. The threshold is the validated baseline plus a tolerance band, not an absolute target. A model that validated at a 2 percent unsupported-claim rate and drifts to 5 percent has breached its control even though 5 percent might be acceptable for a different use case.

Stability

Stability covers behavior that should not change for inputs that did not change. Run a frozen regression set of canonical prompts on a schedule and measure how often the output crosses a meaningful boundary: a different recommended action, a flipped classification, a numeric answer outside tolerance. A spike in flips on unchanged inputs is the clearest early signal that the provider changed the model under you.

Tie the thresholds to materiality

The thresholds and the cadence both scale with the risk tier from module 3. This is the part teams skip, and it is the part examiners notice.

Recurring monitoring at monthly frequency is a reasonable floor, with weekly or daily tracking for higher-risk models. A tier 1 model that can move money or deny a customer earns daily drift checks, daily sampled hallucination scoring, and tight tolerance bands. A tier 3 internal summarizer earns monthly checks and wider bands. Running the same dashboard for both wastes review capacity on the low tier and underserves the high tier.

Materiality also sets what a breach triggers. For a low tier, a moderate-drift alert is a logged note. For a high tier, the same alert should pause the model or route to human review until validation confirms it is still fit for purpose.

A worked example

Take a tier 1 assistant that drafts customer-facing dispute resolutions, validated in March at a 2 percent unsupported-claim rate on a 300-item gold set, with an embedding-drift baseline centroid from March traffic.

We set: embedding cosine drift alert at a 0.15 distance from baseline, bucketed PSI alert on output-length and tool-call distributions at 0.25, sampled hallucination rate alert at 4 percent (baseline plus a 2-point band), and a 200-prompt nightly regression set with an alert if more than 3 percent of outputs flip their recommended action.

In June the nightly regression set shows 9 percent of dispute drafts flipping to a different recommended remedy, with no change to our prompt or retrieval. Embedding drift is flat, so the inputs did not move. Hallucination rate is steady. The isolated stability breach points squarely at a provider-side model change, which process verification confirms against the version metadata. Because this is tier 1, the runbook pauses the model and routes drafts to human review while we re-run the module 5 eval suite against the new version. The same pattern on a tier 3 tool would have been a logged ticket, not a stop.

The takeaway

Monitoring an LLM you cannot open is not about finding one magic metric. It is about reconstructing SR 11-7's process verification, benchmarking, and outcomes analysis from signals that do exist: embedding and bucketed drift, a sampled and baselined hallucination rate, and a frozen stability regression set. Anchor each threshold to the validation baseline, scale the cadence and the response to the materiality tier, and write down what a breach does before one happens. That is what turns a dashboard into a control.

← Previous

Evals as Validation: Golden Sets, Benchmarks, and Challenger Models

Documentation and the Accountability Chain