Use case · document processing

Serverless PDF processing: generate, extract, and transform

Generate PDFs from HTML templates or data, extract text and structured data from uploaded documents, and transform PDFs in background pipeline steps—with no HTTP timeout, binary response support, and retries built in.

Last updated: 2026-04-20

Deploy a PDF processor →Background jobs use case

Answer first

Direct answer

Serverless PDF processing: generate, extract, and transform. HTTP handler accepts the generation or extraction request, returns 202 with a job ID, and triggers a pipeline step. The pipeline step handles the heavy PDF operation with its own memory budget, retries on failure, and stores the result to object storage.

When it fits

Invoice and report generation that takes more than a second
Document data extraction and OCR that runs in background pipelines
Batch PDF generation (N invoices, N reports) with fan-out parallelism

Tradeoffs

PDF libraries (pdf-lib, pdfkit, puppeteer for HTML-to-PDF) are memory-intensive. Generating 10 PDFs in parallel in the same handler invocation can cause memory exhaustion on constrained serverless runtimes.
No retry on failure means a transient library crash or OOM during PDF generation loses the job entirely. The user gets an error and must retry manually.

Workload and what breaks

Why PDF processing does not fit synchronous HTTP handlers

PDF generation from complex templates takes 2–30 seconds—hits gateway timeouts
Text extraction from uploaded PDFs with OCR can take minutes for image-heavy files
Binary response handling (base64 encoding large PDFs) adds memory pressure to synchronous handlers

PDF operations span a wide latency range: a simple 1-page invoice might generate in 500ms, but a 50-page report with embedded charts, a 100-page contract requiring OCR extraction, or a batch of 200 invoices all exceed synchronous handler windows.

Where shortcuts fail

Why inline PDF generation in HTTP handlers is brittle

PDF libraries (pdf-lib, pdfkit, puppeteer for HTML-to-PDF) are memory-intensive. Generating 10 PDFs in parallel in the same handler invocation can cause memory exhaustion on constrained serverless runtimes.

No retry on failure means a transient library crash or OOM during PDF generation loses the job entirely. The user gets an error and must retry manually.

How Inquir helps

PDF operations as async pipeline steps

HTTP handler accepts the generation or extraction request, returns 202 with a job ID, and triggers a pipeline step. The pipeline step handles the heavy PDF operation with its own memory budget, retries on failure, and stores the result to object storage.

Binary PDFs are stored to object storage (S3-compatible) with a pre-signed URL returned to the client—not passed as a large base64 body through the gateway.

What you get

Serverless PDF processing patterns

Template-to-PDF generation

Render an HTML template with data, convert to PDF with puppeteer or a PDF library, store to object storage, return pre-signed URL.

PDF text extraction

Extract text and structured data from uploaded PDFs—invoices, contracts, forms. Return structured JSON for downstream processing.

Batch PDF generation

Generate N PDFs in parallel pipeline steps (fan-out), then merge or zip in a final step (fan-in). Handle 200 invoices in minutes, not hours.

OCR on image PDFs

Pipeline step calls an OCR service or local library (Python tesseract) for scanned document text extraction—long-running, retried on failure.

What to do next

Async PDF generation flow

HTTP handler returns 202 + job ID

Validate request, store job record with status=pending, trigger pipeline with data and job ID.

Pipeline step generates PDF

Render template with data, generate PDF, upload to object storage, update job record with file URL.

Notify client

Return pre-signed download URL via webhook, email, or status endpoint poll.

Code example

Invoice PDF generation pipeline

HTTP handler triggers async generation; pipeline step renders HTML to PDF, uploads, and notifies. Client polls job status.

api/generate-invoice.mjs (HTTP handler)

export async function handler(event) {
  const { invoiceId, customerId } = JSON.parse(event.body || '{}');
  if (!invoiceId) return { statusCode: 400, body: JSON.stringify({ error: 'invoiceId required' }) };
  const existing = await db.invoicePdfs.find(invoiceId);
  if (existing?.url) return { statusCode: 200, body: JSON.stringify({ url: existing.url }) };
  await global.durable.startNew('render-invoice-pdf', undefined, { invoiceId, customerId });
  return { statusCode: 202, body: JSON.stringify({ invoiceId, status: 'generating' }) };
}

jobs/render-invoice-pdf.mjs (pipeline step)

import puppeteer from 'puppeteer-core';

export async function handler(event) {
  const { invoiceId, customerId } = event.payload ?? {};
  const invoice = await db.invoices.findById(invoiceId);
  const html = renderInvoiceTemplate(invoice);
  const browser = await puppeteer.launch({ executablePath: '/usr/bin/chromium' });
  const page = await browser.newPage();
  await page.setContent(html, { waitUntil: 'networkidle0' });
  const pdfBytes = await page.pdf({ format: 'A4', printBackground: true });
  await browser.close();
  const url = await storage.upload(pdfBytes, `invoices/${invoiceId}.pdf`);
  await db.invoicePdfs.upsert({ invoiceId, url, generatedAt: new Date() });
  return { invoiceId, url };
}

When it fits

Use serverless PDF processing for

When this works

Invoice and report generation that takes more than a second
Document data extraction and OCR that runs in background pipelines
Batch PDF generation (N invoices, N reports) with fan-out parallelism

When to skip it

Simple one-page PDFs that consistently generate in under 1 second—keep those synchronous for simpler flow

FAQ

How do I handle large PDF binary responses?

Store the PDF to object storage (S3-compatible) and return a pre-signed download URL. Avoid passing large base64-encoded PDFs as HTTP response bodies—gateway limits and client memory both benefit from storage-backed URLs.

Can I use Python libraries like reportlab or pdfplumber?

Yes—Python 3.12 supports reportlab, pdfplumber, PyMuPDF, and other PDF libraries. Deploy a Python function for extraction and transformation work alongside Node.js functions for HTTP handlers in the same workspace.

Direct answer

When it fits

Tradeoffs

Why PDF processing does not fit synchronous HTTP handlers

Why inline PDF generation in HTTP handlers is brittle

PDF operations as async pipeline steps

Serverless PDF processing patterns

Template-to-PDF generation

PDF text extraction

Batch PDF generation

OCR on image PDFs

Async PDF generation flow

HTTP handler returns 202 + job ID

Pipeline step generates PDF

Notify client

Invoice PDF generation pipeline

Use serverless PDF processing for

✓When this works

×When to skip it

FAQ

Related guides

When this works

When to skip it