Validating a Model You Cannot Open · Model Risk Management for LLMs

You can read the source of a logistic regression. You cannot read the source of a frontier LLM. The weights are proprietary, the training data is undisclosed, and the vendor will not hand over either. Yet the validation standard does not move. SR 11-7 is explicit that third party and vendor models must be validated to the same standard as internally built ones, and that proprietary components do not lower the bar. They just change the method.

This is the gap most teams stall on. The instinct is to ask the vendor for the model so the validators can inspect it. That request will be declined, and even if it were granted, no human is going to audit a few hundred billion parameters. The work has to move from inspecting construction to interrogating behavior and evidence. This lesson covers how to do that without the weights.

What conceptual soundness means when you can't see the model

SR 11-7 names three validation pillars: evaluation of conceptual soundness, ongoing monitoring, and outcomes analysis. Conceptual soundness is the assessment of whether the model's theory, design, and assumptions are appropriate for the intended use. For a model you built, that means reviewing the math, the variable selection, and the developmental testing.

For a model you cannot open, conceptual soundness becomes a question about fit, not construction. You are no longer asking "is this code correct." You are asking whether a general-purpose language model is an appropriate instrument for the specific decision you are wiring it into, and whether the vendor has produced enough evidence to support that the design holds up.

Two questions carry most of the weight here.

First, is the model appropriate for the use? A frontier LLM is a strong summarizer and a weak calculator. If you are using it to draft a customer communication, the design fit is plausible. If you are using it to compute a credit decision or a reserve number, the design fit is questionable on its face, and no amount of vendor documentation rescues a misapplied tool. Conceptual soundness fails at the use case before it ever reaches the model.

Second, what developmental evidence has the vendor actually produced? You cannot replicate their training. You can demand the artifacts that stand in for it.

Vendor challenge without the weights

SR 11-7 expects banks to obtain developmental evidence from the vendor and to maintain contingency plans for the day the vendor changes or withdraws the model. Treat the vendor relationship as an evidence-gathering exercise with a defined minimum.

The evidence you ask for

Request the model card or system card, the published evaluation results, the safety and red-team summaries, and the SOC 2 Type II report. These do not let you open the model, but together they describe how it was built, tested, and operated.

Be precise about what each artifact is worth. A SOC 2 Type II tells you the vendor's claimed controls were operating over a period. It does not tell you how those controls apply to your data or your use case, and it says nothing about model behavior. A model card tells you what the vendor tested and claims, on benchmarks the vendor chose. Useful as a starting point, not as your validation.

The questions that close the gaps

Vendor challenge is the act of pushing back on those claims. The questions that matter most are the ones the marketing material does not answer: What is the model version and the exact identifier, and how will we be notified when it changes? Is the model deterministic at a fixed temperature and seed, or will identical inputs produce different outputs? What is the data retention and training-use policy for the inputs we send? What happens to our pipeline if you deprecate this version with 30 days notice?

Version pinning deserves particular attention. A silent model update is a new model entering production without any validation, and most vendor APIs default to a moving alias. Pin to a dated version identifier and treat any version change as a trigger for re-validation. That is the contingency plan SR 11-7 wants, made concrete.

Outcomes analysis is where you actually validate

The third pillar carries the load when the first two are constrained by opacity. Outcomes analysis compares model output to known-good results, and it does not require the weights. You build your own evidence by testing behavior on data you control.

This is the part you can fully own. Construct a holdout set drawn from your real use case, score the model against it, and benchmark the result against alternatives. Two benchmarks matter: a challenger model, often a cheaper or simpler model or a competing vendor, and a non-model baseline, such as the existing rule-based process or human performance on the same tasks. If the LLM cannot beat the baseline you already run, the design fit question answers itself.

A worked example

A bank wants to use a vendor LLM to triage incoming complaint emails into one of eight regulatory categories, which then routes the case and starts a response clock. The validators cannot see the model, so they validate the behavior.

They assemble 500 historically routed complaints with known-correct categories, hold them out, and run them through the pinned model version twice on separate days. Three things come out of that test. Accuracy against the labeled set gives an outcomes number that can be compared to the prior keyword-routing system. Running the set twice exposes stochasticity, because any case that flips category between runs is a stability defect that a deterministic rule did not have. And slicing accuracy by complaint type surfaces where the model is weak, for instance strong on billing disputes but unreliable on fair-lending-adjacent language, which is exactly the slice where a misroute carries the most regulatory weight.

The output is a validation file with no reference to weights or training data. It documents the use-case fit, the vendor evidence and its limits, the pinned version, the holdout accuracy against a named challenger and baseline, and the measured stability and per-slice performance. That is a defensible validation of a model nobody on the team can open.

The takeaway

Validating a closed model is not a weaker version of validating an open one. It is a different method aimed at the same standard. You move conceptual soundness from code review to use-case fit and vendor evidence, you run vendor challenge as a structured demand for artifacts and version control, and you carry the real proof in outcomes analysis on data you own. The weights stay closed. The validation does not have to.

← Previous

Risk-Tiering by Materiality: Sizing Oversight to Impact

Evals as Validation: Golden Sets, Benchmarks, and Challenger Models