Idempotency, Retries, and Webhook Hell · Shipping a Payments Product: PM & Eng Craft

Every payment request you send crosses a network that can fail after the money moves but before you hear about it. Your client times out, retries, and the customer gets charged twice. Every webhook your PSP sends arrives at least once, sometimes more, sometimes out of order. These two facts are the source of most quiet money-state corruption we see in production. The bug rarely surfaces during the charge. It surfaces three days later when reconciliation (module 6) finds a second capture nobody can explain.

This module is about the failure modes that live in the gap between "request sent" and "outcome confirmed." Get the patterns right here and the ledger you designed in module 4 stays correct under retries. Get them wrong and no amount of careful double-entry bookkeeping saves you.

The retry problem is a knowledge problem

A retry is dangerous only because the caller does not know whether the first attempt succeeded. A 200 with the result is unambiguous. A timeout, a dropped connection, or a 500 tells you nothing about whether the side effect happened on the server.

The naive fix, "do not retry on error," is wrong. Transient failures are normal at scale, and abandoning a charge mid-flight leaves you with an unknown state that is worse than a clean retry. The correct fix is to make retries safe, so that sending the same request twice produces one charge.

That is what idempotency keys are for. The caller generates a unique key per logical operation and attaches it to the request. The server uses that key to recognize a repeat and return the original result instead of acting again.

How a real idempotency layer behaves

Stripe's implementation is a useful reference because the edge cases are documented. You pass an Idempotency-Key header. On the classic API, Stripe saves the status code and response body of the first request under that key, whether it succeeded or failed, and replays that same response for any later request carrying the same key. A 500 gets cached too, so a retried 500 returns the cached 500 rather than risking a second charge. API v2 changes this: instead of replaying a saved error, it retries failed requests without side effects and returns an updated response, with a 30-day scoped retention window.

Two details matter in practice. First, the layer compares the parameters of the repeat request against the original and returns an error if they differ, which catches the bug where you reuse a key but change the amount. Second, keys expire. On the classic API the retention window is 24 hours, after which reusing the key produces a brand new request. On the newer API v2, scoped keys persist for 30 days. If your retry logic can outlive that window, you have reintroduced the double-charge you were trying to prevent.

Designing your own idempotency keys

When you are the server, not just the client, you own the dedup logic. The pattern is a table keyed on the idempotency key, written inside the same database transaction that performs the operation.

sql CREATE TABLE idempotency_keys ( key text PRIMARY KEY, request_hash text NOT NULL, response_code int, response_body jsonb, locked_at timestamptz, created_at timestamptz NOT NULL DEFAULT now() );

The flow on each request is the same shape every time. Try to insert the key. If the insert wins, you hold the lock and you perform the operation, then write the response back to the row and commit. If the insert hits the primary-key conflict, the operation is already in flight or done: return the stored response, or block briefly if response_code is still null because another worker holds it.

Two rules keep this honest. Store a hash of the request body and reject a reused key whose hash does not match, the same protection Stripe gives you. And scope the key to the operation, not the session. A key per "checkout attempt" is right. A key reused across two genuinely different charges is a data-corruption bug waiting to happen.

Webhooks are the same problem wearing a different mask

Webhooks invert the direction. Now your PSP is the caller and you are the unreliable network. Stripe guarantees at-least-once delivery and retries failed deliveries with exponential backoff for up to 72 hours. The same event.id will arrive more than once, and events can arrive out of order, so a payment_intent.succeeded can land before the payment_intent.created you expected first.

Three discipline points cover most of the damage.

Verify the signature on the raw body

Stripe signs every event with the Stripe-Signature header. Verify it before you trust anything, using the raw request bytes. If your framework parses the JSON before you reach the verification step, the re-serialized body no longer matches the signed payload and verification fails, or worse, you skip it. Capture the raw body first.

Dedup on the event id, then act

Treat the event.id (the evt_... value) as the idempotency key for inbound events. Insert it into a table with a unique constraint inside the handler's transaction. If the insert conflicts, you have seen this event, so acknowledge with a 200 and stop. Do not reprocess. The unique constraint, not your application if check, is what makes this race-safe across concurrent deliveries.

Acknowledge fast, process for real separately

Return 200 quickly once you have durably recorded the event, then do the heavy work asynchronously. If your handler does slow ledger writes inline and times out, Stripe sees a failure and redelivers, multiplying load on an already struggling system. Record, ack, then process.

A worked example

A customer taps Pay. Your client sends a charge with idempotency key chk_8f21a90c. The PSP captures the funds, then the response is lost to a timeout. Your client retries with the same key. The PSP recognizes chk_8f21a90c, returns the original success, and no second capture occurs. One charge.

Moments later the payment_intent.succeeded webhook arrives, event evt_1Pabc. Your handler verifies the signature on the raw body, inserts evt_1Pabc into the events table, wins the insert, marks the order paid, and returns 200. A network blip makes Stripe redeliver evt_1Pabc 30 seconds later. This time the insert conflicts on the unique constraint, the handler returns 200 without touching the order, and the ledger stays at exactly one paid order.

Now remove any one of those guards. Drop the idempotency key and the retry double-captures. Drop the event-id dedup and the redelivery marks the order paid twice, or fires a second fulfillment. Each guard is load-bearing on its own.

The takeaway

Retries and at-least-once webhooks are not edge cases you can defer. They are the steady-state behavior of payment systems, and they will exercise every gap you leave. Make outbound requests idempotent with a key your retry logic respects inside the provider's retention window. Make inbound events idempotent with a unique constraint on the event id and a fast, signature-verified, record-then-process handler. Done well, every duplicate becomes a no-op, and the ledger stays correct no matter how many times the network makes you say the same thing twice.

← Previous

Designing a Ledger That Stays Correct

Reconciliation and Settlement Timing: Closing the Books Against the Rail