Multi-step serverless pipelines: retries, branching, and human approval
Model long, multi-step work as a serverless pipeline of small steps — each with its own timeout, step retries, and run record — with branching, fan-out/fan-in, human approval gates that pause and resume, and cron triggers.
Most backend work that matters is not one function call — it is a sequence. Validate the order, charge the card, render the invoice, email it. Somewhere in the middle there is a branch (“is this a high-value order?”) and, often, a human who has to click Approve before money moves. The reflex on any serverless platform is to stuff all of it into a single handler. That holds up right until the third step throws, or the whole thing runs past the function timeout, and you are left with a 500 in the logs and no idea which of the six things actually failed.
This post is about the other shape: modeling that work as a serverless pipeline — a graph of small steps, each with its own timeout, its own step retries, and its own run record. We will walk through how Inquir Compute’s pipeline graph handles multi-step serverless work end to end: chaining steps, branching, fan-out fan-in serverless parallelism, a human approval workflow that pauses and resumes, cron-triggered runs, and a full execution history — plus an honest section on what it deliberately does not promise.
From one giant function to a multi-step serverless pipeline
Here is the shape to avoid — one function doing everything:
// One function that does everything — the shape to avoid.
export async function handler(event) {
const order = await validateOrder(event.body); // can throw
const charge = await chargePayment(order); // can throw — money moves
const invoice = await renderInvoice(charge); // slow
await emailInvoice(invoice); // flaky third-party
return { ok: true };
}
It reads fine. It fails badly. Four problems, all structural:
- Partial failure is invisible. If
emailInvoicethrows, the caller sees a500. Did the charge go through? Probably. Was the invoice rendered? Who knows. The failure and the state live in different places. - Retrying replays everything. Re-run this handler after the email failed and you re-run
validateOrderandchargePaymenttoo. On a payment step, “retry the whole function” is exactly how you double-charge a customer. - One timeout for the whole chain. Every function has a time budget — on Inquir Compute the default is 5 seconds, and the hard ceiling is 15 minutes. Four sequential calls share that single budget. A slow invoice render can push the whole request over the edge and kill the parts that already succeeded.
- No per-step record. One invocation, one log stream, one duration. You cannot see that step 2 took 4.9s and step 4 has failed the last three nights running.
The fix is not a bigger function — it is smaller ones, wired together. Make each step its own function and let the platform run them as a pipeline: each step gets its own timeout, its own retry policy, and its own persisted run record. When the email step is flaky, you retry the email step, not the charge.
A serverless pipeline is a graph of small steps
A pipeline on Inquir Compute is a JSON graph: a list of nodes and the edges between them. A node is one unit of work or control flow. The kinds you will actually place are:
- Triggers —
manualTrigger,httpTrigger,cronTrigger. Where a run starts. lambda— invoke one of your deployed functions. This is where your code runs.if— branch on a condition.parallel/merge— fan a run out into branches and join them back.set/mapper— shape data between steps (write tovars, or map one payload into another) without a function call.respond— build the HTTP response for a request-triggered pipeline.humanGate— pause for a person.
Edges carry the flow, and each edge leaves its source on a handle: success, error, true, false, parallel, approve, reject, or the plain default. That handle is how a single node routes to different next steps depending on what happened.
Here is the order flow from the previous section, rebuilt as a graph:
{
"schemaVersion": 1,
"nodes": [
{ "id": "in", "kind": "httpTrigger", "name": "Order received", "config": { "method": "POST" } },
{ "id": "valid", "kind": "lambda", "name": "Validate order",
"config": { "functionId": "validate-order", "onError": "failPipeline" } },
{ "id": "charge", "kind": "lambda", "name": "Charge payment",
"config": { "functionId": "charge-payment",
"retryPolicy": { "maxAttempts": 3, "backoffMs": 1000, "strategy": "exponential" },
"onError": "errorBranch" } },
{ "id": "email", "kind": "lambda", "name": "Email invoice",
"config": { "functionId": "email-invoice",
"inputMapping": { "to": "{{trigger.body.email}}", "charge": "{{steps.charge.output}}" },
"retryPolicy": { "maxAttempts": 5, "backoffMs": 500, "strategy": "exponential" },
"onError": "continue" } },
{ "id": "ok", "kind": "respond", "name": "200 OK",
"config": { "statusCode": 200, "outputMapping": { "orderId": "{{steps.charge.output.id}}" } } },
{ "id": "fail", "kind": "respond", "name": "402 Declined",
"config": { "statusCode": 402, "outputMapping": { "error": "payment_failed" } } }
],
"edges": [
{ "id": "e1", "sourceNodeId": "in", "targetNodeId": "valid" },
{ "id": "e2", "sourceNodeId": "valid", "targetNodeId": "charge", "sourceHandle": "success" },
{ "id": "e3", "sourceNodeId": "charge", "targetNodeId": "email", "sourceHandle": "success" },
{ "id": "e4", "sourceNodeId": "charge", "targetNodeId": "fail", "sourceHandle": "error" },
{ "id": "e5", "sourceNodeId": "email", "targetNodeId": "ok", "sourceHandle": "success" }
]
}
A few things worth calling out, because they are the whole point:
- Each
lambdanode is a real, isolated invocation — its own container, its own memory, its own timeout (5s by default, overridable per node viatimeoutMsup to the 15-minute max), and its own persisted step-execution record with the input it received, the output or error it produced, the attempt number, and the duration. Lambda steps also link back to the underlying function invocation, so you can jump from the pipeline view straight to the raw call. - Steps pass data through templates. A node reads upstream values with
{{steps.<id>.output}}, the trigger with{{trigger.body}}, run-scoped variables with{{vars.<name>}}, and its direct input with{{input}}. Above,emailbuilds its own payload withinputMappinginstead of just receiving the previous node’s output. - Handles do the routing. The
chargenode sends itssuccesshandle to the email step and itserrorhandle to a402response. One node, two futures.
The graph is configuration; your logic still lives in ordinary functions. The pipeline’s job is to wire them, route between them, retry them, and remember what happened.
Step retries and error handling without replaying the whole job
Every lambda node takes an optional retry policy:
"retryPolicy": { "maxAttempts": 3, "backoffMs": 1000, "strategy": "exponential" }
strategy is fixed (wait backoffMs between every attempt) or exponential (backoffMs, then 2×, 4×, …). The important default: there is no retry unless you ask for one — an unconfigured step runs exactly one attempt. Retries are a per-step opt-in, not a platform-wide “everything runs five times.” That is deliberate: you want aggressive retries on a flaky email API and zero retries on a non-idempotent charge.
Because retries are per step, a failure re-runs only the failed step. The charge that already succeeded stays succeeded; only the email step tries again, with backoff, up to its own maxAttempts. Each attempt is written as its own record, so “attempt 2 of 3, failed after 5s” is something you can actually see.
When a step exhausts its attempts, onError decides what the pipeline does:
failPipeline(the default) — stop the run and mark itFAILED.continue— keep going; downstream steps receive{ error, __pipelineFromFailedStep }so they can react. Good for best-effort work (“post to the analytics API, but never fail the order over it”).errorBranch— route down the node’serrorhandle so you can build an explicit compensating path, like the402branch above.
One rule follows directly from all this: make your step handlers idempotent. A step can run more than once — retries, a manual rerun, an errorBranch that loops back to a fix. The platform does not promise exactly-once execution, so a charge step should carry an idempotency key and an “insert” should be an upsert. Design for “this might run twice” and retries become safe instead of scary.
Branching and fan-out/fan-in in a serverless workflow
Straight-line pipelines are the minority. Real serverless workflow logic forks.
The if node evaluates a small sandboxed JavaScript expression against the upstream payload and takes the true or false handle:
{
"nodes": [
{ "id": "route", "kind": "if", "name": "High value?",
"config": { "expression": "input.amount > 1000" } },
{ "id": "review", "kind": "humanGate", "name": "Manual review", "config": { "mode": "approve" } },
{ "id": "auto", "kind": "lambda", "name": "Auto-approve", "config": { "functionId": "auto-approve" } }
],
"edges": [
{ "id": "e1", "sourceNodeId": "route", "targetNodeId": "review", "sourceHandle": "true" },
{ "id": "e2", "sourceNodeId": "route", "targetNodeId": "auto", "sourceHandle": "false" }
]
}
The expression sees input, trigger, vars, steps, and merge inputs; it is wrapped in Boolean(...) and runs under a tight time budget with require/eval/import blocked. Here a high-value order routes to a human review gate and everything else auto-approves — branching and approval in four nodes.
For work that splits rather than chooses, use fan-out fan-in serverless structure: a parallel node fans the run out across several branches, and a merge node joins them:
{
"nodes": [
{ "id": "fan", "kind": "parallel", "name": "Fan out", "config": {} },
{ "id": "geo", "kind": "lambda", "name": "Geo-risk score", "config": { "functionId": "geo-risk" } },
{ "id": "fraud", "kind": "lambda", "name": "Fraud score", "config": { "functionId": "fraud-score" } },
{ "id": "join", "kind": "merge", "name": "Combine scores",
"config": { "mode": "all",
"outputMapping": { "geo": "{{inputs.geo}}", "fraud": "{{inputs.fraud}}" } } }
],
"edges": [
{ "id": "e1", "sourceNodeId": "fan", "targetNodeId": "geo", "sourceHandle": "parallel" },
{ "id": "e2", "sourceNodeId": "fan", "targetNodeId": "fraud", "sourceHandle": "parallel" },
{ "id": "e3", "sourceNodeId": "geo", "targetNodeId": "join", "sourceHandle": "success" },
{ "id": "e4", "sourceNodeId": "fraud", "targetNodeId": "join", "sourceHandle": "success" }
]
}
The merge node runs in mode: "all" — it waits until every incoming branch has arrived before it produces anything, then resolves its outputMapping, where {{inputs.<sourceNodeId>}} is each branch’s payload keyed by the node that produced it. The value of this pattern is structural: each branch is an independent sub-path with its own steps, its own retries, and its own records, and the merge is the single, explicit point where they come back together. Score geo-risk and fraud independently, then combine both into one decision object for the next step.
Human approval workflow: gates that pause and resume
This is the feature people underestimate. A human approval workflow on most stacks means a queue, a webhook, a database flag, and a second entry-point function to resume — plumbing you write and maintain. Here it is one node.
A humanGate has two modes. approve gives you two outgoing handles, approve and reject. question asks a person something and feeds their submitted answer back into the run through a single default edge.
{
"nodes": [
{ "id": "gate", "kind": "humanGate", "name": "Approve refund",
"config": { "mode": "approve",
"promptTemplate": "Refund {{input.amount}} to {{trigger.body.email}}?" } },
{ "id": "pay", "kind": "lambda", "name": "Issue refund", "config": { "functionId": "issue-refund" } },
{ "id": "deny", "kind": "lambda", "name": "Notify denied", "config": { "functionId": "notify-denied" } }
],
"edges": [
{ "id": "e1", "sourceNodeId": "gate", "targetNodeId": "pay", "sourceHandle": "approve" },
{ "id": "e2", "sourceNodeId": "gate", "targetNodeId": "deny", "sourceHandle": "reject" }
]
}
When a run reaches the gate, it suspends. The step is recorded as WAITING, and a checkpoint — a serializable snapshot of the run — is written to the execution: the trigger payload, run variables, the output of every step completed so far, any in-flight merge join buffers, the response built up to this point, and the resume pointer (which gate, which step, which mode). Nothing is holding a process open; the run is durable at rest and can wait as long as it needs to.
There is one detail here that shows the state is genuinely checkpointed rather than hand-waved. If the gate sits inside a fan-out, the sibling branches that had not run yet when the gate suspended are saved with the checkpoint as pending edges. When someone acts on the gate, the pipeline resumes from the snapshot, continues down the matching handle (approve → issue the refund, reject → notify), and runs those pending siblings — so a gate in one branch never silently abandons the others, and a downstream merge that was waiting on them can still complete.
The gate’s own output merges the upstream payload with humanGate metadata — mode, the resolved prompt, and the decision (approve/reject) or the submitted answer. Downstream steps read it like any other step output, e.g. {{steps.gate.output.humanGate.decision}}. The key mental model: approval here is a node in the graph, not a blocking call in your code. You do not write a function that waits on an event. You drop a gate on the canvas and point its handles at the next steps.
Cron triggers and a full execution history per run
A pipeline does not need a caller. A cronTrigger node runs it on a schedule:
{
"schemaVersion": 1,
"nodes": [
{ "id": "cron", "kind": "cronTrigger", "name": "Nightly 02:00",
"config": { "cron": "0 2 * * *", "timezone": "UTC" } },
{ "id": "extract", "kind": "lambda", "name": "Extract",
"config": { "functionId": "nightly-extract",
"retryPolicy": { "maxAttempts": 3, "backoffMs": 2000, "strategy": "exponential" } } },
{ "id": "transform", "kind": "lambda", "name": "Transform", "config": { "functionId": "nightly-transform" } },
{ "id": "publish", "kind": "lambda", "name": "Publish", "config": { "functionId": "nightly-publish" } }
],
"edges": [
{ "id": "e1", "sourceNodeId": "cron", "targetNodeId": "extract" },
{ "id": "e2", "sourceNodeId": "extract", "targetNodeId": "transform", "sourceHandle": "success" },
{ "id": "e3", "sourceNodeId": "transform", "targetNodeId": "publish", "sourceHandle": "success" }
]
}
The node takes a standard cron expression and an optional timezone, and it is validated when you save the pipeline. Two honest caveats: scheduling is poll-based, so treat it as minute-level at the finest and expect a fire to land a little after the wall-clock minute — this is not a second-level, real-time scheduler. And a single scheduler instance is elected (via a Postgres advisory lock) so a scheduled run fires once, not once per replica.
Whatever triggers a run — HTTP, cron, or manual — you get the same thing back: a full execution history. Each run has an overall status (PENDING, RUNNING, WAITING, SUCCEEDED, FAILED, TIMED_OUT, or CANCELLED), a duration, and a step tree. Every node in that tree carries its input, its output or its error, its attempt count, and its duration; lambda steps link to the invocation underneath. You can cancel a RUNNING pipeline, and cancellation is checked between steps and between retry attempts — so a run sitting in a multi-minute backoff reacts promptly instead of finishing the sleep first. This is the difference between “a 500 somewhere last night” and “the transform step, attempt 2, timed out after 5s on this exact input.”
Limits, trade-offs, and what a pipeline is not
Accuracy matters more than a sales pitch, so here is the honest boundary:
- Steps are time-boxed. Chain, don’t stretch. Each step runs under the same function limits: 5s default, 15 minutes maximum. Long work is many steps, each under the ceiling — not one unbounded function. If a job needs an hour, it needs to be chunked into resumable steps.
- No exactly-once, no guaranteed ordering. A step can run more than once (retries, reruns, error loops), and the platform makes no exactly-once or strict-ordering promise. Idempotent handlers are not optional.
- Retries are opt-in. The default is a single attempt. Add a
retryPolicyto the steps that should retry; leave it off where a retry would be dangerous. - Cron is poll-based and minute-level. Fine for nightly, hourly, every-few-minutes jobs. Not for sub-minute or to-the-second scheduling, and fires can be slightly late.
- Fan-out is structure, not a speed guarantee.
parallel/mergegive you independent, separately-retried branches joined at one point. Treat it as isolation and clarity, not a promise of an N-way wall-clock speedup. - Cold starts are real. Warm pools cut them, but the first invoke after idle is cold — and a pipeline is many invocations, so some steps will pay it. Keep steps lean.
- Versions are capped and pinned. A pipeline keeps up to 50 published versions, and a run executes against the version it started on.
None of these are dealbreakers; they are the shape of the tool. Respect them and pipelines become boringly reliable, which is exactly what you want from infrastructure.
Takeaway
If a unit of work has more than one failure mode, more than one step, or a human in the loop, it should not be one function. Model it as a serverless pipeline: small lambda steps, each with its own timeout, its own step retries, and its own run record; if for branching; parallel + merge for fan-out fan-in serverless work; a humanGate that suspends, checkpoints, and resumes for a real human approval workflow; and a cronTrigger for scheduled runs. Keep every step small, idempotent, and inside the timeout, and let the graph handle the wiring, the routing, the retries, and the memory. That is pipeline orchestration doing the boring, durable work so your functions can stay simple.