Inquir Compute logoInquir Compute
Use case · document processing

Serverless PDF processing: generate, extract, and transform

Generate PDFs from HTML templates or data, extract text and structured data from uploaded documents, and transform PDFs in background pipeline steps—with no HTTP timeout, binary response support, and retries built in.

Last updated: 2026-04-20

Direct answer

Serverless PDF processing: generate, extract, and transform. HTTP handler accepts the generation or extraction request, returns 202 with a job ID, and triggers a pipeline step. The pipeline step handles the heavy PDF operation with its own memory budget, retries on failure, and stores the result to object storage.

When it fits

  • Invoice and report generation that takes more than a second
  • Document data extraction and OCR that runs in background pipelines
  • Batch PDF generation (N invoices, N reports) with fan-out parallelism

Tradeoffs

  • PDF libraries (pdf-lib, pdfkit, puppeteer for HTML-to-PDF) are memory-intensive. Generating 10 PDFs in parallel in the same handler invocation can cause memory exhaustion on constrained serverless runtimes.
  • No retry on failure means a transient library crash or OOM during PDF generation loses the job entirely. The user gets an error and must retry manually.

Why PDF processing does not fit synchronous HTTP handlers

  • PDF generation from complex templates takes 2–30 seconds—hits gateway timeouts
  • Text extraction from uploaded PDFs with OCR can take minutes for image-heavy files
  • Binary response handling (base64 encoding large PDFs) adds memory pressure to synchronous handlers

PDF operations span a wide latency range: a simple 1-page invoice might generate in 500ms, but a 50-page report with embedded charts, a 100-page contract requiring OCR extraction, or a batch of 200 invoices all exceed synchronous handler windows.

Why inline PDF generation in HTTP handlers is brittle

PDF libraries (pdf-lib, pdfkit, puppeteer for HTML-to-PDF) are memory-intensive. Generating 10 PDFs in parallel in the same handler invocation can cause memory exhaustion on constrained serverless runtimes.

No retry on failure means a transient library crash or OOM during PDF generation loses the job entirely. The user gets an error and must retry manually.

PDF operations as async pipeline steps

HTTP handler accepts the generation or extraction request, returns 202 with a job ID, and triggers a pipeline step. The pipeline step handles the heavy PDF operation with its own memory budget, retries on failure, and stores the result to object storage.

Binary PDFs are stored to object storage (S3-compatible) with a pre-signed URL returned to the client—not passed as a large base64 body through the gateway.

Serverless PDF processing patterns

Template-to-PDF generation

Render an HTML template with data, convert to PDF with puppeteer or a PDF library, store to object storage, return pre-signed URL.

PDF text extraction

Extract text and structured data from uploaded PDFs—invoices, contracts, forms. Return structured JSON for downstream processing.

Batch PDF generation

Generate N PDFs in parallel pipeline steps (fan-out), then merge or zip in a final step (fan-in). Handle 200 invoices in minutes, not hours.

OCR on image PDFs

Pipeline step calls an OCR service or local library (Python tesseract) for scanned document text extraction—long-running, retried on failure.

Async PDF generation flow

1

HTTP handler returns 202 + job ID

Validate request, store job record with status=pending, trigger pipeline with data and job ID.

2

Pipeline step generates PDF

Render template with data, generate PDF, upload to object storage, update job record with file URL.

3

Notify client

Return pre-signed download URL via webhook, email, or status endpoint poll.

Invoice PDF generation pipeline

HTTP handler triggers async generation; pipeline step renders HTML to PDF, uploads, and notifies. Client polls job status.

api/generate-invoice.mjs (HTTP handler)
export async function handler(event) {
  const { invoiceId, customerId } = JSON.parse(event.body || '{}');
  if (!invoiceId) return { statusCode: 400, body: JSON.stringify({ error: 'invoiceId required' }) };
  const existing = await db.invoicePdfs.find(invoiceId);
  if (existing?.url) return { statusCode: 200, body: JSON.stringify({ url: existing.url }) };
  await global.durable.startNew('render-invoice-pdf', undefined, { invoiceId, customerId });
  return { statusCode: 202, body: JSON.stringify({ invoiceId, status: 'generating' }) };
}
jobs/render-invoice-pdf.mjs (pipeline step)
import puppeteer from 'puppeteer-core';

export async function handler(event) {
  const { invoiceId, customerId } = event.payload ?? {};
  const invoice = await db.invoices.findById(invoiceId);
  const html = renderInvoiceTemplate(invoice);
  const browser = await puppeteer.launch({ executablePath: '/usr/bin/chromium' });
  const page = await browser.newPage();
  await page.setContent(html, { waitUntil: 'networkidle0' });
  const pdfBytes = await page.pdf({ format: 'A4', printBackground: true });
  await browser.close();
  const url = await storage.upload(pdfBytes, `invoices/${invoiceId}.pdf`);
  await db.invoicePdfs.upsert({ invoiceId, url, generatedAt: new Date() });
  return { invoiceId, url };
}

Use serverless PDF processing for

When this works

  • Invoice and report generation that takes more than a second
  • Document data extraction and OCR that runs in background pipelines
  • Batch PDF generation (N invoices, N reports) with fan-out parallelism

When to skip it

  • Simple one-page PDFs that consistently generate in under 1 second—keep those synchronous for simpler flow

FAQ

How do I handle large PDF binary responses?

Store the PDF to object storage (S3-compatible) and return a pre-signed download URL. Avoid passing large base64-encoded PDFs as HTTP response bodies—gateway limits and client memory both benefit from storage-backed URLs.

Can I use Python libraries like reportlab or pdfplumber?

Yes—Python 3.12 supports reportlab, pdfplumber, PyMuPDF, and other PDF libraries. Deploy a Python function for extraction and transformation work alongside Node.js functions for HTTP handlers in the same workspace.

Inquir Compute logoInquir Compute

The simplest way to run AI agents and backend jobs without infrastructure.

Contact info@inquir.org

© 2025 Inquir Compute. All rights reserved.