Serverless CSV processing: parse, transform, and load large files
Accept a CSV upload via HTTP, return 202 immediately, and process the file in a background pipeline. Parse millions of rows, validate and transform data, upsert in batches with idempotency keys, and notify users when the import completes—all outside the HTTP timeout window.
Last updated: 2026-04-20
Answer first
Direct answer
Serverless CSV processing: parse, transform, and load large files. The HTTP handler validates the file reference (URL or storage key), stores job metadata, and returns 202 with a job ID. The pipeline step reads the CSV in chunks, upserts idempotently by row ID, and records progress for resumability.
When it fits
- Customer data imports, product catalog uploads, bulk user migrations
- Regular CSV exports from external systems that need nightly or triggered processing
Tradeoffs
- Even with streaming parsers, the HTTP handler must stay open while rows are being upserted. A database slow-down mid-import causes a timeout at the gateway, the client retries, and you get duplicate rows unless you built idempotency from the start.
- Memory management is harder than it looks: a "streaming" CSV parse that collects validated rows in memory for a bulk insert still loads the entire file into RAM before writing.
Workload and what breaks
Why CSV processing breaks synchronous HTTP handlers
- Files with 100k+ rows take 30–300 seconds to parse and insert—well past gateway timeouts
- Inline processing holds the HTTP connection open—clients time out or retry, causing double imports
- Memory pressure: loading a 50MB CSV into a serverless function in a single pass causes OOM on small runtimes
CSV import is one of the most common patterns that does not fit synchronous HTTP handlers. The file size is unpredictable, the parse and validation time grows linearly with rows, and any failure mid-import without idempotency creates partial data that is hard to recover from.
Where shortcuts fail
Why streaming CSV in the HTTP handler is fragile
Even with streaming parsers, the HTTP handler must stay open while rows are being upserted. A database slow-down mid-import causes a timeout at the gateway, the client retries, and you get duplicate rows unless you built idempotency from the start.
Memory management is harder than it looks: a "streaming" CSV parse that collects validated rows in memory for a bulk insert still loads the entire file into RAM before writing.
How Inquir helps
HTTP accepts, pipeline processes in chunks
The HTTP handler validates the file reference (URL or storage key), stores job metadata, and returns 202 with a job ID. The pipeline step reads the CSV in chunks, upserts idempotently by row ID, and records progress for resumability.
Long CSVs can be processed in multiple pipeline steps—split by row range, fan out in parallel, fan in to a summary step. Each step has its own timeout budget; failure in one step retries that step without restarting from row 1.
What you get
Serverless CSV processing patterns
Chunked processing with resumability
Split large CSVs into row-range pipeline steps. A failure at row 80k resumes from that checkpoint, not from row 1.
Parallel batch upsert
Fan out multiple pipeline steps to process row ranges in parallel—reduce total import time for large files.
Idempotent row upsert
Use a stable row identifier (external ID or row hash) as the upsert key. Re-running the import on the same file produces the same database state.
Progress tracking and notification
HTTP handler returns a job ID. Client polls a status endpoint; final pipeline step notifies via webhook or email when complete.
What to do next
Serverless CSV import flow
HTTP handler accepts file reference, returns 202
Validate the file URL or storage key. Store job record with status=pending. Trigger pipeline with file reference and job ID.
Pipeline step reads and processes CSV
Download and parse the CSV in the pipeline step. Upsert rows in batches of 500–1000 with idempotency keys. Update progress on job record.
Final step notifies
After last batch, mark job complete and notify the user via email, webhook, or status update.
Code example
Chunked CSV import pipeline
HTTP handler returns 202; pipeline step processes the file in batches. Idempotency key prevents duplicate rows on retry.
export async function handler(event) { const { fileUrl, importId } = JSON.parse(event.body || '{}'); if (!fileUrl || !importId) return { statusCode: 400, body: JSON.stringify({ error: 'fileUrl and importId required' }) }; await db.imports.create({ id: importId, status: 'pending', fileUrl }); await global.durable.startNew('process-csv', undefined, { fileUrl, importId }); return { statusCode: 202, body: JSON.stringify({ importId, status: 'pending' }) }; }
import { parse } from 'csv-parse/sync'; export async function handler(event) { const { fileUrl, importId } = event.payload ?? {}; const csvText = await fetch(fileUrl).then((r) => r.text()); const rows = parse(csvText, { columns: true, skip_empty_lines: true }); let inserted = 0; for (const batch of chunk(rows, 500)) { // Upsert by external_id — idempotent on retry await db.records.upsertBatch(batch.map((r) => ({ ...r, importId }))); inserted += batch.length; } await db.imports.update(importId, { status: 'done', rowCount: rows.length }); return { importId, rows: rows.length, inserted }; }
When it fits
Use serverless CSV processing for
When this works
- Customer data imports, product catalog uploads, bulk user migrations
- Regular CSV exports from external systems that need nightly or triggered processing
When to skip it
- Tiny CSV files under 1000 rows that process in under 5 seconds—keep those synchronous for simpler debugging
FAQ
FAQ
How do I handle CSV validation errors?
Collect validation errors per row and store them on the import job record. Return a summary (valid rows, invalid rows, error list) when the job completes. Let the client decide whether to proceed with partial import or fix errors first.
How do I resume a failed import?
Track the last successfully processed row offset on the job record. Re-trigger the pipeline with offset; the step skips already-processed rows. Use upsert-on-external-id for safety.