Streaming Serverless Responses with Server-Sent Events (SSE)
Streaming is the single biggest lever on perceived latency. Here's how to stream responses from serverless functions with SSE — async generators, LLM token streaming, disconnect handling, and heartbeats — and when not to.
If you have used a modern AI product, you have felt the difference streaming makes. The answer does not appear all at once after an awkward pause — it types itself out, token by token, the moment the model starts producing it. That is not a cosmetic trick. It is the single biggest lever you have on perceived latency, and on serverless it used to be the hardest thing to get right.
This post is about streaming responses from serverless functions with Server-Sent Events (SSE): what SSE is, why it fits serverless better than WebSockets for this job, how to stream from a function with an async generator, how to survive client disconnects, and the patterns that matter for LLM token streaming and long-running job progress.
Why streaming matters more than raw speed
A request that takes four seconds to return a 2,000-token answer feels slow. The same request, streamed, feels fast — because the user sees the first words in a few hundred milliseconds and reads along as the rest arrives. Nothing about the total work changed; the experience changed completely.
Streaming pays off whenever the response is produced incrementally:
- LLM token streaming — the classic ChatGPT-style typewriter. The model emits tokens as it decodes; you forward each one the instant it exists.
- AI agent progress — “searching…”, “calling the pricing tool…”, “drafting…”. Multi-step agents can be silent for seconds; streaming turns that silence into visible progress.
- Long jobs with intermediate output — a report generator, a data export, a migration that emits a line per processed row.
- Search and retrieval — first results on screen while the long tail is still ranking.
In every case the goal is the same: shorten time-to-first-byte, not total time. Streaming is how you do that without lying to the user with a fake progress bar.
Server-Sent Events vs WebSockets
The two obvious ways to push data to a browser are WebSockets and Server-Sent Events. For streaming a response — one server talking, the client listening — SSE is almost always the better fit, and it is what a serverless gateway can expose cleanly.
Server-Sent Events is a dead-simple standard: the server responds with Content-Type: text/event-stream and writes text frames like data: {"delta":"Hel"}\n\n. The browser consumes them with the built-in EventSource API (or fetch + a stream reader). It is one-directional, runs over plain HTTP, reconnects automatically, and needs no handshake or special protocol.
WebSockets give you a full-duplex socket. That is powerful when the client and server chat back and forth continuously — multiplayer, collaborative editing, live cursors. But it is a persistent, stateful connection, which is exactly what serverless functions are not built to hold. For “generate a response and stream it out,” a WebSocket is a heavyweight answer to a one-directional question.
The rule of thumb: if the data flows one way (server → client), reach for SSE. It is simpler to build, simpler to operate, and it maps naturally onto a request/response gateway.
Streaming from a serverless function: return an async generator
On Inquir, a function streams by returning an async generator. Instead of computing a whole result and returning it, you yield pieces as they become available, and the gateway forwards each one to the client as an SSE event in real time.
export async function* handler(event, context) {
const { prompt } = JSON.parse(event.body);
// Stream tokens straight from the model as they decode.
const completion = await llm.stream({ prompt });
for await (const token of completion) {
yield { delta: token }; // -> data: {"delta":"..."}\n\n on the wire
}
yield { done: true, usage: completion.usage };
}
Two things are happening here. First, the function is an async function* — an async generator — so the runtime knows to stream it rather than wait for a single return value. Second, each yield is flushed to the client immediately as a data: frame. A yielded object is serialized to JSON and sent as data: {...}\n\n; a yielded string is written to the wire as-is, so you can emit custom named SSE events (event: tool_start\ndata: {...}\n\n) when you need them.
The client side is just as small:
const res = await fetch('/api/agent', { method: 'POST', body: JSON.stringify({ prompt }) });
const reader = res.body.getReader();
const decoder = new TextDecoder();
for (;;) {
const { value, done } = await reader.read();
if (done) break;
render(decoder.decode(value)); // append the delta to the UI
}
No socket, no library, no protocol negotiation. A POST that happens to answer with a stream.
An event model for AI agents
Raw token deltas are enough for a plain completion, but an agent that calls tools benefits from a small vocabulary of named events so the UI can show what the agent is doing, not just what it is saying. A useful convention:
start— the run began; echo the query and the available tools.delta— a content token; the real-time typewriter.tool_start— the agent invoked a tool (show a “running search…” chip).tool_result— the tool returned (collapse the chip, maybe show a citation).done— the final answer plus a usage/summary payload.error— something failed mid-stream; the client can show a partial answer and a retry.
Because you control what you yield, you emit exactly these events at the right moments. The UI becomes a small state machine driven by the stream, and the “thinking…” states stop being fake spinners and start being real.
Surviving disconnects — the part everyone forgets
Here is the failure mode that bites teams in production: the user closes the tab halfway through a 30-second answer, but your function keeps running — still calling the model, still burning tokens, still holding the worker busy for a request nobody is listening to. Multiply that by real traffic and you are paying for a lot of work that reaches no one.
Inquir hands your streaming handler an abort signal on the context so you can react to a disconnect:
export async function* handler(event, context) {
const completion = await llm.stream({ prompt, signal: context.signal });
for await (const token of completion) {
if (context.signal.aborted) return; // client left — stop pulling tokens
yield { delta: token };
}
}
context.signal is a standard AbortController signal. When the client disconnects, the runtime aborts it and stops draining your generator, which runs its finally blocks so you can cancel the upstream model call and release resources. Passing context.signal straight into your LLM SDK (most support an AbortSignal) means a closed tab actually cancels the expensive work instead of orphaning it.
Heartbeats, buffering, and proxies
Two operational details separate a demo from something that survives real networks:
- Heartbeats. A stream can legitimately go quiet for many seconds — an agent thinking, a slow tool call. Idle connections get killed by load balancers and proxies. The gateway sends a periodic SSE comment (
: ping) to keep the connection warm through those intermediaries, so a thoughtful pause does not read as a dead socket. - Flush, don’t buffer. SSE only feels real-time if each frame is flushed rather than buffered until the response ends. The gateway flushes headers immediately to put the client into streaming mode and yields to the event loop between chunks so the network layer can push bytes out as they are produced. If you have ever seen a “streaming” endpoint dump its entire output in one burst at the end, that was buffering — and it defeats the whole point.
You do not configure any of this; it is the default behavior of the streaming path. But knowing it is there explains why a quiet agent stays connected and why your tokens actually arrive one at a time.
Streaming across Node.js, Python, and Go
Streaming is a first-class capability in all three runtimes, with one shape difference worth knowing:
- Node.js 22 and Python 3.12 stream the way you would expect — an async generator that
yields (Python:async def/yield). - Go 1.22 streams by returning a channel that the runtime drains into the SSE response, which fits Go’s concurrency model better than a generator would.
One caveat to plan around: Go streaming requires the warm/hot execution path; the cold path rejects streaming. In practice you keep a warm pool for latency-sensitive streaming endpoints anyway, so this rarely bites — but it is the kind of per-runtime difference worth knowing before you pick a language for a streaming service rather than discovering it in production.
When not to stream
Streaming is not free complexity, and it is the wrong tool for plenty of endpoints:
- Small, instant responses. If the whole payload is 200 bytes and computed in 5ms, streaming just adds machinery. Return JSON.
- Work that outlasts the request. Streaming keeps a connection open while the work runs; it does not make the work durable. For a 20-minute export, do not stream for 20 minutes — accept the request, return fast, run it as a background job, and let the client poll or receive a webhook. Streaming is for output produced during a request the user is actively waiting on, not for offloading long work.
- Machine-to-machine calls that want one JSON blob. If the caller is another service that just wants the final result, a normal response is simpler for everyone.
Match the transport to the shape of the work: instant → JSON, produced-incrementally-while-waiting → SSE stream, outlasts-the-request → background job.
The takeaway
Streaming is the cheapest large win in AI UX, and on a serverless backend it comes down to a small set of moves: return an async generator, yield events as they happen, respect the abort signal so a closed tab cancels real work, and let the platform handle heartbeats and flushing. Do that and a four-second answer feels instant — first token in a few hundred milliseconds, the rest reading itself out while your worker quietly streams and your model call gets cancelled the moment nobody is listening.
If you are building chat, agents, or anything where the answer arrives a piece at a time, stream it. Your users will read the difference before they can measure it.