Idempotency: making webhooks and background jobs safe to retry

At-least-once delivery means the same webhook will arrive more than once, so duplicates are inevitable. Here is the idempotency-key pattern, natural upserts, and where Inquir's startNew dedup, retries, and dead-letter queue help — and where you must be idempotent yourself.

Idempotency: making webhooks and background jobs safe to retry

A retried request is not a bug in someone else’s system — it is the normal operating mode of every webhook provider and every durable job queue you will ever integrate. Stripe resends a failed payment_intent.succeeded event for up to 72 hours. GitHub redelivers for three days. Slack marks your app as slow if you do not answer within three seconds, then tries again. Your own background queue will re-run a job whose acknowledgement got lost on the wire. If the second, third, or tenth copy of the same event double-charges a customer or writes a duplicate row, the problem is not the network — it is that your handler was never safe to retry.

This post is about making it safe. We will cover why at-least-once delivery makes a duplicate webhook inevitable, the idempotency key pattern, natural idempotency through upserts, the difference between exactly-once vs at-least-once, and — honestly — the line between what Inquir Compute deduplicates for you and what you must make idempotent yourself.

Why at-least-once delivery makes duplicate webhooks inevitable

Every reliable messaging system in production gives you the same delivery guarantee: at-least-once delivery. It means a message will arrive one or more times. Never zero, but also never guaranteed to be exactly one. The reason is a simple, unavoidable fact of distributed systems: the sender cannot tell the difference between “the receiver never got my message” and “the receiver got it, processed it, and the acknowledgement was lost on the way back.” Faced with that ambiguity, a correct system re-sends. Re-sending is what keeps you from losing events; the price is that you sometimes get them twice.

That is why every serious provider retries aggressively. A Stripe webhook is retried for up to 72 hours until it sees a 2xx. GitHub redelivers for three days. Slack expects a response inside three seconds and re-fires if it does not get one. Now stack the real world on top: a load balancer times out mid-request, a deploy rolls a pod while it is still writing, a database lock makes your handler take four seconds instead of two hundred milliseconds. Every one of those turns a single logical event into two or more physical deliveries.

The same thing happens inside your own infrastructure, not just at the provider edge. Inquir’s durable job queue is at-least-once by design. When you enqueue background work with global.durable.startNew(), the platform persists it in Postgres, retries it with backoff if you opt in, and reaps jobs that exceed their visibility timeout — which means a job whose worker stalled can be picked up and run again. There is also no guaranteed ordering: two events that arrive close together may be processed out of sequence. Duplicates and reordering are not edge cases to patch later. They are the contract. Design for them from the first line of the handler.

Exactly-once vs at-least-once: what you can and cannot buy

Engineers reach for “exactly-once” as if it were a setting to enable. It mostly is not. True exactly-once delivery — a guarantee that a message crosses the network and is handed to your code precisely one time — is impossible in the general case, for the same lost-acknowledgement reason above. Any system that advertises “exactly-once” has, underneath, an at-least-once transport plus a deduplication layer. It has not abolished duplicates; it has hidden them behind a dedup check.

That distinction matters because it tells you where the real work lives. You cannot buy exactly-once delivery, but you can engineer exactly-once effect — sometimes called effectively-once processing. The formula is:

at-least-once delivery + idempotent processing = effectively-once effect

Idempotency is the property that applying an operation once and applying it many times produce the same result. If your handler is idempotent, you no longer care how many times a message is delivered. The first delivery does the work; every subsequent delivery is a safe no-op. Retries stop being dangerous and become merely redundant — which is exactly what you want, because retries are how you avoid losing data.

So the goal is not to eliminate duplicates. It is to make duplicates harmless. Everything below is a technique for doing that: dedup on a stable id before you mutate, prefer operations that are naturally idempotent, and lean on the platform’s dedup and retry machinery where it actually helps — while staying clear-eyed about where it does not.

The idempotency key pattern: dedup on the provider event id before mutating

The core pattern for an idempotent webhook is: pick a stable idempotency key, record it before you perform any side effect, and skip the work if you have seen that key before. The best key is almost always one the provider already gives you — the event id. Stripe events carry evt.id; GitHub deliveries carry a delivery UUID. These are stable across retries of the same logical event, which is precisely the property you need.

The mechanism that makes this safe under concurrency is a conditional insert. Two duplicate deliveries can arrive at the same instant on two containers; a naive “SELECT then INSERT” has a race between the check and the write. A single INSERT ... ON CONFLICT DO NOTHING is atomic: exactly one of the two racers inserts the row, the other gets zero affected rows and knows it lost.

On Inquir, verify the signature first — for Stripe and GitHub the gateway can verify the raw-body HMAC for you when you set webhookMode on the route, returning 403 BAD_SIGNATURE before your code runs and applying Stripe’s timestamp replay tolerance. Then dedup, ACK fast, and hand the heavy work to the durable queue:

-- The idempotency ledger. The provider's event id is the primary key,
-- so a duplicate delivery can never insert a second row.
CREATE TABLE processed_events (
  event_id    text PRIMARY KEY,
  provider    text NOT NULL,
  received_at timestamptz NOT NULL DEFAULT now()
);
// webhooks/stripe.mjs — verify, dedup on the event id, then hand off.
export async function handler(event) {
  // The gateway already verified the HMAC (route webhookMode: 'stripe').
  // body still arrives as a string; parse only AFTER verification.
  const evt = JSON.parse(event.body ?? '{}');

  // 1. Idempotency key = the provider's own event id. Claim it BEFORE any mutation.
  //    ON CONFLICT DO NOTHING => a duplicate delivery affects 0 rows.
  const { rowCount } = await db.query(
    `INSERT INTO processed_events (event_id, provider)
     VALUES ($1, 'stripe') ON CONFLICT (event_id) DO NOTHING`,
    [evt.id],
  );
  if (rowCount === 0) {
    return { statusCode: 200, body: 'duplicate' }; // already seen — safe no-op, still ACK
  }

  // 2. ACK inside the provider window, then run the heavy work in the durable queue.
  //    Passing evt.id as the instance id lets the platform dedup this enqueue (24h TTL).
  await global.durable.startNew('fulfill-order', evt.id, { eventId: evt.id, type: evt.type });
  return { statusCode: 200, body: 'accepted' };
}

Two things earn their keep here. First, you claim the key before mutating, not after — so a crash between the insert and the handoff leaves a recorded key and no half-applied side effect, and the provider’s retry finds the key already present. Second, you return 200 even for a duplicate: acknowledging a duplicate is correct, because you have genuinely finished with it. Returning an error would only summon another retry.

Natural idempotency: upserts instead of blind inserts

The ledger above guards the entry point. But the job it enqueues runs on an at-least-once queue too, so the job body must also survive being run more than once. The cleanest way is to make the write itself naturally idempotent, so there is nothing to dedup.

A write is naturally idempotent when running it twice leaves the same state as running it once. SET status = 'fulfilled' is naturally idempotent; balance = balance + 100 is not. INSERT ... ON CONFLICT DO UPDATE — an upsert keyed on a business identity — is the workhorse here: the first run inserts, every later run converges to the same row instead of creating a second one.

// jobs/fulfill-order.mjs — runs at-least-once, so it MUST be idempotent on its own.
export async function handler(event) {
  const { eventId, type } = event.payload ?? {};

  // Natural idempotency: upsert on a business key, never a blind INSERT.
  // A retry, a visibility-timeout reap, or a provider redelivery all converge
  // to the SAME order row instead of creating a duplicate.
  await db.query(
    `INSERT INTO orders (event_id, status, fulfilled_at)
     VALUES ($1, 'fulfilled', now())
     ON CONFLICT (event_id) DO UPDATE SET status = 'fulfilled'`,
    [eventId],
  );

  // Side effects that are NOT naturally idempotent (charge a card, send an email)
  // need their own guard: check-before-act keyed on eventId, or pass the provider's
  // own idempotency key so THEIR system dedups the duplicate for you.
  const already = await db.query(
    `SELECT 1 FROM sent_receipts WHERE event_id = $1`, [eventId],
  );
  if (already.rowCount === 0) {
    await email.sendReceipt(eventId);                 // external call: at-most-once effect wanted
    await db.query(`INSERT INTO sent_receipts (event_id) VALUES ($1)
                    ON CONFLICT DO NOTHING`, [eventId]);
  }

  return { ok: true };
}

Notice the two-tier design. State you own — the order row — is made idempotent by construction with an upsert. Side effects you don’t own — charging a card, sending an email — cannot be undone, so you guard them with an explicit check-before-act on the same key, or better, pass the provider’s own idempotency key (Stripe’s Idempotency-Key header, for example) so their system collapses your duplicate. Reach for natural idempotency first; fall back to an explicit dedup ledger only for the effects that genuinely cannot be expressed as an upsert.

Where the platform helps: startNew 24h dedup, retries, and the dead-letter queue

Inquir gives you three concrete tools, and it is worth being precise about each so you know its edges.

startNew deduplicates the enqueue within a 24-hour TTL. The second argument to global.durable.startNew(name, id, payload) is the instance id, and it doubles as an idempotency key for the enqueue itself. If you pass a stable id — the provider event id is ideal — two identical startNew calls within a 24-hour window collapse to a single job instance instead of spawning two. That turns a duplicate delivery that slipped past your ledger into a duplicate enqueue that the platform absorbs.

Retries with exponential backoff are available — opt-in for plain jobs. A plain background job defaults to maxAttempts: 1, meaning a single attempt and no automatic retry unless you raise it. (The platform’s own internal resume path uses five attempts.) When you do opt in, failed attempts are retried with exponential backoff so a flaky downstream service gets breathing room instead of a hammering.

A dead-letter path catches poison messages. When a job exhausts its configured maxAttempts, the durable queue dead-letters it and records the last error, so a permanently failing job is parked for inspection and replay rather than silently lost or retried forever. Jobs that exceed their visibility timeout are reaped and made runnable again, which is another reason the same job body can execute more than once.

Used together, these give you safe retries as an operational default: the entry-point ledger stops most duplicates, the startNew TTL absorbs a class of the rest, retries recover from transient failures, and the dead-letter queue quarantines the genuinely broken. None of it, however, makes your handler idempotent. That part is on you.

Where you must be idempotent: handlers run at-least-once

Here is the boundary to internalize. The platform’s 24-hour dedup covers the enqueue — the act of putting a startNew job on the queue. It does not cover the body of the job. Your handler code, and every side effect it performs, gets no automatic deduplication from the platform. If the same job runs twice — because you opted into retries and attempt one crashed after writing, because the visibility timeout reaped a slow run, because a duplicate slipped through outside the 24-hour window — your handler executes its side effects again, from the top.

That is why every example above makes the handler idempotent independently, rather than trusting the queue to run it once. The upsert on orders, the check-before-act around the email — those exist precisely because the platform will, correctly and by design, sometimes run the handler more than once. A handler that assumes single execution is a latent duplicate-charge waiting for the first retry.

The rule is short: treat every handler as if it will run at least twice, because eventually it will. Key every mutation on a stable identity. Make owned state converge with an upsert. Guard un-undoable external effects with an explicit dedup or the downstream provider’s idempotency key. Do that, and a retry is boring — which is the whole point.

What the platform does NOT guarantee

Being honest about the guarantees is what keeps your data correct, so state them plainly:

  • No exactly-once delivery. The durable job queue and webhook ingress are at-least-once. A message can and will be delivered more than once. There is no mode that changes this.
  • No guaranteed ordering. Events are not FIFO. Do not write logic that assumes event B is processed after event A just because it was produced later; carry a version or timestamp and let the later state win.
  • The startNew dedup is bounded, not absolute. It deduplicates identical enqueues only within a 24-hour TTL and only keyed on the id you pass. The same id after 24 hours starts a fresh instance, and it says nothing about how many times the job body runs.
  • Handlers have no automatic idempotency. The queue dedups the enqueue, never the side effects inside the job. Idempotent processing is your responsibility, in your handler.
  • Retries are opt-in for plain jobs. A plain job defaults to a single attempt; you configure the retry count and backoff you want. Dead-lettering and visibility-timeout reaping exist, but they recover delivery, not correctness — a job that runs twice still needs an idempotent body to be safe.

Read the list as a design brief, not a disclaimer. Every item points to the same conclusion: the platform handles delivery and recovery; you handle idempotency.

Takeaway: design for duplicates, not against them

You cannot buy exactly-once delivery, and you do not need it. At-least-once delivery plus an idempotent handler gives you the effect you actually want — each event applied once — while keeping the retries that stop you from losing data. Pick a stable idempotency key, ideally the provider’s event id. Claim it before you mutate with an atomic INSERT ... ON CONFLICT DO NOTHING. Make owned writes naturally idempotent with upserts, and guard un-undoable side effects with an explicit dedup or the downstream provider’s idempotency key. Let Inquir carry the operational weight — startNew’s 24-hour enqueue dedup, opt-in retries with backoff, and the dead-letter queue — while you keep the one guarantee the platform cannot give you: a handler that is safe to run twice. Build it that way once, and every retry, redelivery, and duplicate webhook becomes a non-event.