Feature Stores and Point-in-Time Correctness: Reproducibility as a Control · The Data Substrate for AI in Finance

Most feature store pitches lead with speed: reuse features, ship models faster, stop rewriting the same SQL. That case is real, but it is not the case that matters to a model risk team. The question an examiner asks is narrower and harder. Given a score your model produced on a specific account on a specific day, can you reconstruct the exact feature values that went into it, prove no future information leaked in, and show the same logic ran in training and in production?

If you cannot answer that, your lineage from earlier modules stops at the database. The feature layer is where reproducibility either holds or quietly breaks.

Why the feature layer is the failure point

A raw data warehouse tells you what a customer's balance is today. A model needs to know what the balance was at the moment of the decision, which may have been six months ago and several updates back. The gap between those two facts is where two common failures live.

The first is data leakage. If your training set joins a feature as it looks now rather than as it looked at decision time, the model learns from information that did not exist yet. Performance looks excellent in development and collapses in production, because the future you trained on is no longer available to predict the present.

The second is training-serving skew. Most teams compute features twice: a batch pipeline builds the training set, and a separate online path computes the same features at inference. Subtle differences between those two code paths mean the model is scored on inputs it was never trained on. The model is fine. The plumbing is not.

Reproducibility is not a nice-to-have report. It is the control that lets you stand behind a specific score on a specific day.

Point-in-time correctness, precisely

Point-in-time correctness means that for any training example, the feature store returns each feature exactly as it existed at that example's timestamp, and never later. The mechanism is an as-of join, sometimes called feature time travel. You provide an entity and a timestamp, and the store reaches back through the feature's history to pull the most recent value at or before that moment.

Concretely: you supply a spine of (accountid, decisiontime, label) rows. For each row, the store joins each feature using its own event timestamp, taking the latest value that is still less than or equal to decision_time. A balance updated the day after the decision is invisible to that row. This is what prevents label leakage and what makes a training set an honest reconstruction of history rather than a flattering one.

The serving side reuses the same feature definitions, ideally the same transformation code, so the value computed online matches the value the offline store would have produced. One definition, two read paths. That is the structural fix for skew.

A worked example

A card issuer builds a fraud model. One feature is txn_count_30d, the number of transactions on the account in the prior 30 days.

The naive batch job computes, for every account in the training set, the 30-day count as of the day the job runs. An account flagged for fraud in March gets a count that already reflects the cluster of fraudulent transactions, plus the chargebacks and account freeze that followed. The model learns that frozen accounts commit fraud, which is true and useless, because the freeze is a consequence of the label, not a predictor available before it.

With a point-in-time join, each training row for that account uses decision_time set to the moment the transaction was authorized. The 30-day window looks backward from there. The post-decision transactions, the chargebacks, and the freeze are all after the timestamp, so the store excludes them. The feature now reflects only what a real-time system could have seen at authorization. The development AUC drops, which is the point. The number you ship is the number you can defend.

When the same model scores a live transaction, the online store computes txn_count_30d from the identical definition. If the offline and online definitions diverge, even by a rounding rule or a timezone, the score is built on inputs the model never saw in training, and your reproducibility claim is void.

What makes it a control, not just architecture

A feature store becomes an auditable control when three things are true and recorded.

Features are versioned and timestamped

Each feature has a definition under version control and an event timestamp on every value. When a definition changes, the old version stays queryable. You can re-run a six-month-old scoring decision against the feature logic and the data state that were live that day, not the ones live now. This is the feature-layer expression of the lineage discipline from Module 2.

Training and serving share one definition

Skew is closed by construction, not by reconciliation. If the online and offline values are computed from separate code, you are testing for skew forever. If they derive from one definition, you remove the failure mode rather than monitoring it. Where on-demand computation is unavoidable, log both values and alert on divergence.

The reconstruction is reproducible end to end

SR 11-7, the Federal Reserve and OCC supervisory guidance on model risk management, expects documentation detailed enough that an informed party can reproduce the model's results. For machine learning models in practice, that reaches past the algorithm to the inputs: the data snapshot, the feature versions, the transformation code, and the timestamps. A feature store that pins all four turns "reproduce this score" from a research project into a query.

Closing takeaway

Treat the feature store as the place where reproducibility is enforced, not where features are merely convenient to find. Insist on point-in-time joins for every training set, one feature definition across training and serving, and version plus timestamp on every value. When you can take a score from any past day and rebuild its exact inputs, you have a control an examiner can test. That is the bar this module asks you to clear before the data moves on to vendor controls and audit rights.

← Previous

Provenance for RAG: Knowing What the Model Retrieved and From Where

Vendor Data Controls and Audit Rights