Risk-Tiering by Materiality: Sizing Oversight to Impact · Model Risk Management for LLMs

Once you have an inventory (the prior module), you face the question that determines how much of your year gets consumed by validation: which models get the full treatment, and which get a lighter touch. Tier everything at the top and your validation team drowns. Tier too low and the first material failure becomes a finding. The job is to put each model where its impact says it belongs, and to be able to show your work.

The revised interagency guidance, SR 26-2, issued jointly by the Federal Reserve, OCC, and FDIC on April 17, 2026, made materiality the operational spine of model risk management rather than a footnote. Where SR 11-7 left tiering as an implied consequence of its proportionality principle, SR 26-2 names the inputs directly. That gives us a cleaner target to design against.

One caveat up front. SR 26-2 formally places generative AI and agentic AI outside its scope, describing those systems as novel and rapidly evolving, and directs institutions to extend their existing model-risk practices to cover them rather than apply the guidance verbatim. LLMs are generative AI, so SR 26-2 does not directly govern LLM tiers. What we do here is apply SR 26-2's materiality principles, exposure and purpose, by analogy to fill that gap, which is exactly the extension the agencies ask for.

What materiality actually means under SR 26-2

SR 26-2 defines materiality through two inputs: model exposure and model purpose.

Exposure is the financial or operational footprint of the decisions the model drives. A model standing behind $10 billion in credit decisions carries more exposure than one that produces an internal propensity score nobody acts on directly.

Purpose is the weight of the decision regardless of dollar size. Models that feed regulatory reporting, capital adequacy, or fair-lending determinations carry high purpose even when their exposure looks modest. The guidance is explicit that a low-exposure tool with a high regulatory purpose, a compliance screening model for example, can still warrant rigorous oversight.

The two inputs combine, they do not average. A high reading on either axis pulls the model up. This matters for LLM use cases because exposure and purpose often diverge: an LLM that drafts internal summaries has low exposure and low purpose, while an LLM that screens loan applications for missing documentation has modest exposure but high purpose the moment its output touches a lending decision.

Why LLMs strain the old tiering logic

Traditional tiering assumed a model had one owner, one use, and one output you could trace to a number. LLMs break all three. The same base model can sit behind a customer chatbot, a document classifier, and an internal research tool at once, each with different exposure and purpose.

So we tier the application, not the model weights. The unit of tiering is the deployed use case: this model, this prompt and retrieval setup, this decision it informs, this population it touches. One foundation model can generate five inventory entries across five tiers, and that is correct.

A worked tiering decision

Take a mid-size bank with a single LLM-backed application: an assistant that reviews mortgage applications and flags ones with potential fair-lending concerns for a human reviewer.

Run it through the two axes.

Exposure. The model does not approve or deny loans. It routes files to human reviewers. The direct financial footprint is indirect, so exposure looks moderate rather than severe.

Purpose. The output touches a fair-lending determination, one of the most heavily scrutinized decision types a US bank makes. Purpose is high. Under SR 26-2's combine-not-average rule, the high-purpose reading governs.

Tier result: high (Tier 1). Annual independent validation, a full evaluation suite, and documented ongoing monitoring. Worth noting the same application would also land in the EU AI Act's high-risk category, since access to essential services such as credit sits in its Annex III list. Under the Digital Omnibus changes, those Annex III high-risk obligations now apply from December 2, 2027, deferred from the earlier August 2, 2026 date, with the amendments still being formally adopted. When two regimes agree a use case is high-stakes, that is a useful cross-check on your own scoring.

Now change one fact. The same model, instead, summarizes published research for the strategy team and informs no customer-facing or regulated decision. Exposure low, purpose low. Tier result: low (Tier 3). Identify it, monitor for the conditions under which it could become material, and spend your validation hours elsewhere. SR 26-2 explicitly permits exactly this: deeming a model immaterial, then watching for the point at which its use changes.

The instructive part is that the model is identical in both cases. The tier is a property of the deployment, not the technology.

Making the tiering itself auditable

Here is where most programs lose the examiner. The tier assignment is a judgment, and an unrecorded judgment is indistinguishable from a guess. SR 26-2 leans harder on documented judgment than its predecessor, which means the reasoning behind a tier is now part of what gets reviewed, not just the tier label.

Build the tiering as a record, not a result. For every model in the inventory, capture:

The two input readings. Exposure and purpose, each on a defined scale, with one sentence of justification apiece. "Purpose: high. Output informs a fair-lending referral."
The combination rule applied. State that the higher axis governs, so nobody later reads a Tier 1 model and wonders why a moderate-exposure tool sits at the top.
Who assigned it and when. A name and a date. Tiering done by an anonymous spreadsheet is a finding waiting to happen.
The re-tier triggers. The specific conditions that would move the model up: a change in use, a population it starts touching, an exposure threshold it crosses. This is the operational form of SR 26-2's instruction to monitor immaterial models for the point they become material.

Guard against the two failure modes

The first failure is grade inflation, tiering everything high to be safe. It feels prudent and it is actually a control weakness, because it tells the examiner your scale carries no information. If 90 percent of your inventory is Tier 1, you do not have a tiering system, you have a label.

The second is silent drift. An LLM application is repurposed, its retrieval source changes, or it quietly starts feeding a decision it never touched before, and the tier never moves. Tie re-tiering to your change-management process so any material change to a deployed model forces a tier review before it ships. The re-tier triggers you wrote down are what make this automatic rather than aspirational.

The takeaway

Tiering is the lever that decides where your validation budget goes, so the lever has to be honest. Score each deployed use case on exposure and purpose, let the higher reading govern, and write down the reasoning, the owner, the date, and the triggers that would move it. Do that and the tier stops being a number you defend under pressure and becomes a record that defends itself. The validation work in the modules ahead only earns its keep if it lands on the right models, and that is what this step buys you.

← Previous

The Inventory Problem: Finding Every LLM Already Running in Your Bank

Validating a Model You Cannot Open