Most application tests check that a request returns the right shape. Payments tests have to check something harder: that the money is right. A handler can return 200 OK, render a clean receipt, and still have left your ledger a dollar short or double-captured a charge. The bug is not a stack trace. It is a number that does not add up three days later, after settlement, when a finance analyst opens a spreadsheet.
That gap is why "it worked in the demo" is worthless evidence in payments. We test the unhappy paths we scoped in module 2, against the ledger we built in module 4, through the retry and webhook machinery from module 5. This module is where those pieces get proven, not assumed.
Drive the sandbox through failure, not success
Every PSP ships a sandbox, and the happy-path card is the least useful thing in it. Stripe's 4242 4242 4242 4242 always succeeds; 4000 0000 0000 0002 always returns a card_declined. The point of the sandbox is the second card, and the dozens like it.
Build a test matrix from the decline and exception codes your PSP exposes, not from the ones you hope to hit. With Stripe that means dedicated cards per failure: 4000 0000 0000 0069 for an expired card, 4000 0000 0000 0127 for an incorrect CVC, plus the cards that force a 3-D Secure challenge. Adyen does the same by letting you force a specific outcome through the cardholder name (set holderName to values like DECLINED, CARD_EXPIRED, or FRAUD) or the RequestedTestAcquirerResponseCode field. That lets one card number cover many outcomes in a parameterized test.
What the matrix has to cover
A sandbox suite that only proves "good card clears" is theater. The rows that matter are the ones that bend your state machine:
- Soft declines (insufficient funds) that your retry logic is allowed to retry, versus hard declines (lost or stolen card) that it must never retry.
- 3DS challenge flows where the customer abandons mid-authentication and the payment is left pending.
- Partial captures and partial refunds, which are where ledger math quietly goes wrong.
- Authentication results beyond plain success: the EMV 3DS
transStatusvalues include C (challenge required), N (not authenticated), A (attempted), U (unable), and R (rejected). Each lands you in a different state, and each needs a test.
Assert on the resulting ledger entries and order state, not on the PSP response alone. A decline that leaves an order stuck in processing is a real defect even though every API call succeeded.
Reconciliation proofs: the test that catches the missing dollar
The single most valuable money test is an invariant: after any sequence of operations, your internal ledger must reconcile against the PSP's record of the same events, to the cent. If it does not, you have lost the thread between what you think happened and what the network thinks happened.
Run this as a property-based or scenario test, not a one-off. Generate a sequence of operations against a test customer, then assert two things hold:
- Your double-entry ledger still balances internally (debits equal credits, no orphaned entries).
- The sum of your captured-minus-refunded amounts equals the PSP's balance transactions for that account over the same window.
A worked example
Take a $100 charge, a $30 partial refund, then a webhook-driven dispute that withdraws the remaining $70. Walk it through:
- Capture: ledger shows +$100 to a receivable, the PSP shows a $100 balance transaction. Reconciled.
- Partial refund: ledger writes a -$30 entry, PSP shows a -$30 balance transaction. Net $70 each side. Still reconciled.
- Dispute: the PSP withdraws $70 plus a dispute fee. If your test does not also book the fee, your two sides now disagree by the fee amount.
That fee is the bug. It will not show up in any sandbox 200 OK. It only appears when the reconciliation assertion compares your net to the PSP's net and finds them off by, say, $15. Catching it in CI is the difference between a code review comment and a finance escalation in month two.
Test time, because settlement does not happen now
Settlement, renewals, and dispute windows all unfold over days. You cannot wait a billing cycle for a test to finish, and you must not let real wall-clock time leak into deterministic tests.
Stripe's test clocks solve this directly: you freeze a clock at a chosen point, attach test customers and subscriptions to it, then advance time forward in increments and observe how billing objects change. A full trial-to-renewal-to-failed-payment lifecycle that would take 40 real days runs in minutes. There is one constraint worth knowing: from the frozen point you can advance at most two billing intervals at a time, so a monthly subscription moves up to two months per step.
Use this to prove the boundary cases that bite in production: the renewal that fails because the card expired between sign-up and rebill, the dunning sequence that should retry on schedule, the dispute that arrives near the close of its window. Where your PSP lacks a time primitive, inject a controllable clock into your own scheduler so reconciliation and retry jobs are testable without sleep.
Chaos: break the connection before the network does
The PSP sandbox is reliable in a way production never is. That reliability hides your worst bugs, because it never disconnects you mid-capture. So you have to introduce the failure yourself.
The scenario that matters most is the one from module 5: you send a capture, the network times out, and you do not know whether it landed. Test it by injecting a timeout or a dropped connection between your service and the PSP, then asserting that your retry replays the same idempotency key and the PSP returns the original result rather than charging twice. A capture path that double-charges under a single injected timeout is not ready to ship.
Practical injection points
You do not need a chaos platform to start. A fault-injecting proxy or stub between your code and the PSP that can drop the response after the request is received covers the highest-value case. Add scenarios for delayed and duplicate webhooks, since a webhook delivered twice must produce one ledger effect, and a webhook that arrives out of order must not regress your state. Run these in CI on the critical paths, not as a quarterly fire drill.
The takeaway
Treat a passing payments test suite as proof of money correctness, not request correctness. Drive the sandbox through every decline and authentication branch you can name, assert that your ledger reconciles to the PSP's net to the cent after each scenario, fast-forward through settlement and billing time with test clocks, and inject the timeouts and duplicate deliveries that production will eventually send you anyway. If a test does not check a number, it is not testing money.