Serverless PDF processing: generate, extract, and transform
Generate PDFs from HTML templates or data, extract text and structured data from uploaded documents, and transform PDFs in background pipeline steps—with no HTTP timeout, binary response support, and retries built in.
Last updated: 2026-04-20
Answer first
Direct answer
Serverless PDF processing: generate, extract, and transform. HTTP handler accepts the generation or extraction request, returns 202 with a job ID, and triggers a pipeline step. The pipeline step handles the heavy PDF operation with its own memory budget, retries on failure, and stores the result to object storage.
When it fits
- Invoice and report generation that takes more than a second
- Document data extraction and OCR that runs in background pipelines
- Batch PDF generation (N invoices, N reports) with fan-out parallelism
Tradeoffs
- PDF libraries (pdf-lib, pdfkit, puppeteer for HTML-to-PDF) are memory-intensive. Generating 10 PDFs in parallel in the same handler invocation can cause memory exhaustion on constrained serverless runtimes.
- No retry on failure means a transient library crash or OOM during PDF generation loses the job entirely. The user gets an error and must retry manually.
Workload and what breaks
Why PDF processing does not fit synchronous HTTP handlers
- PDF generation from complex templates takes 2–30 seconds—hits gateway timeouts
- Text extraction from uploaded PDFs with OCR can take minutes for image-heavy files
- Binary response handling (base64 encoding large PDFs) adds memory pressure to synchronous handlers
PDF operations span a wide latency range: a simple 1-page invoice might generate in 500ms, but a 50-page report with embedded charts, a 100-page contract requiring OCR extraction, or a batch of 200 invoices all exceed synchronous handler windows.
Where shortcuts fail
Why inline PDF generation in HTTP handlers is brittle
PDF libraries (pdf-lib, pdfkit, puppeteer for HTML-to-PDF) are memory-intensive. Generating 10 PDFs in parallel in the same handler invocation can cause memory exhaustion on constrained serverless runtimes.
No retry on failure means a transient library crash or OOM during PDF generation loses the job entirely. The user gets an error and must retry manually.
How Inquir helps
PDF operations as async pipeline steps
HTTP handler accepts the generation or extraction request, returns 202 with a job ID, and triggers a pipeline step. The pipeline step handles the heavy PDF operation with its own memory budget, retries on failure, and stores the result to object storage.
Binary PDFs are stored to object storage (S3-compatible) with a pre-signed URL returned to the client—not passed as a large base64 body through the gateway.
What you get
Serverless PDF processing patterns
Template-to-PDF generation
Render an HTML template with data, convert to PDF with puppeteer or a PDF library, store to object storage, return pre-signed URL.
PDF text extraction
Extract text and structured data from uploaded PDFs—invoices, contracts, forms. Return structured JSON for downstream processing.
Batch PDF generation
Generate N PDFs in parallel pipeline steps (fan-out), then merge or zip in a final step (fan-in). Handle 200 invoices in minutes, not hours.
OCR on image PDFs
Pipeline step calls an OCR service or local library (Python tesseract) for scanned document text extraction—long-running, retried on failure.
What to do next
Async PDF generation flow
HTTP handler returns 202 + job ID
Validate request, store job record with status=pending, trigger pipeline with data and job ID.
Pipeline step generates PDF
Render template with data, generate PDF, upload to object storage, update job record with file URL.
Notify client
Return pre-signed download URL via webhook, email, or status endpoint poll.
Code example
Invoice PDF generation pipeline
HTTP handler triggers async generation; pipeline step renders HTML to PDF, uploads, and notifies. Client polls job status.
export async function handler(event) { const { invoiceId, customerId } = JSON.parse(event.body || '{}'); if (!invoiceId) return { statusCode: 400, body: JSON.stringify({ error: 'invoiceId required' }) }; const existing = await db.invoicePdfs.find(invoiceId); if (existing?.url) return { statusCode: 200, body: JSON.stringify({ url: existing.url }) }; await global.durable.startNew('render-invoice-pdf', undefined, { invoiceId, customerId }); return { statusCode: 202, body: JSON.stringify({ invoiceId, status: 'generating' }) }; }
import puppeteer from 'puppeteer-core'; export async function handler(event) { const { invoiceId, customerId } = event.payload ?? {}; const invoice = await db.invoices.findById(invoiceId); const html = renderInvoiceTemplate(invoice); const browser = await puppeteer.launch({ executablePath: '/usr/bin/chromium' }); const page = await browser.newPage(); await page.setContent(html, { waitUntil: 'networkidle0' }); const pdfBytes = await page.pdf({ format: 'A4', printBackground: true }); await browser.close(); const url = await storage.upload(pdfBytes, `invoices/${invoiceId}.pdf`); await db.invoicePdfs.upsert({ invoiceId, url, generatedAt: new Date() }); return { invoiceId, url }; }
When it fits
Use serverless PDF processing for
When this works
- Invoice and report generation that takes more than a second
- Document data extraction and OCR that runs in background pipelines
- Batch PDF generation (N invoices, N reports) with fan-out parallelism
When to skip it
- Simple one-page PDFs that consistently generate in under 1 second—keep those synchronous for simpler flow
FAQ
FAQ
How do I handle large PDF binary responses?
Store the PDF to object storage (S3-compatible) and return a pre-signed download URL. Avoid passing large base64-encoded PDFs as HTTP response bodies—gateway limits and client memory both benefit from storage-backed URLs.
Can I use Python libraries like reportlab or pdfplumber?
Yes—Python 3.12 supports reportlab, pdfplumber, PyMuPDF, and other PDF libraries. Deploy a Python function for extraction and transformation work alongside Node.js functions for HTTP handlers in the same workspace.