Inquir Compute logoInquir Compute
Use case · data processing

Serverless CSV processing: parse, transform, and load large files

Accept a CSV upload via HTTP, return 202 immediately, and process the file in a background pipeline. Parse millions of rows, validate and transform data, upsert in batches with idempotency keys, and notify users when the import completes—all outside the HTTP timeout window.

Last updated: 2026-04-20

Direct answer

Serverless CSV processing: parse, transform, and load large files. The HTTP handler validates the file reference (URL or storage key), stores job metadata, and returns 202 with a job ID. The pipeline step reads the CSV in chunks, upserts idempotently by row ID, and records progress for resumability.

When it fits

  • Customer data imports, product catalog uploads, bulk user migrations
  • Regular CSV exports from external systems that need nightly or triggered processing

Tradeoffs

  • Even with streaming parsers, the HTTP handler must stay open while rows are being upserted. A database slow-down mid-import causes a timeout at the gateway, the client retries, and you get duplicate rows unless you built idempotency from the start.
  • Memory management is harder than it looks: a "streaming" CSV parse that collects validated rows in memory for a bulk insert still loads the entire file into RAM before writing.

Why CSV processing breaks synchronous HTTP handlers

  • Files with 100k+ rows take 30–300 seconds to parse and insert—well past gateway timeouts
  • Inline processing holds the HTTP connection open—clients time out or retry, causing double imports
  • Memory pressure: loading a 50MB CSV into a serverless function in a single pass causes OOM on small runtimes

CSV import is one of the most common patterns that does not fit synchronous HTTP handlers. The file size is unpredictable, the parse and validation time grows linearly with rows, and any failure mid-import without idempotency creates partial data that is hard to recover from.

Why streaming CSV in the HTTP handler is fragile

Even with streaming parsers, the HTTP handler must stay open while rows are being upserted. A database slow-down mid-import causes a timeout at the gateway, the client retries, and you get duplicate rows unless you built idempotency from the start.

Memory management is harder than it looks: a "streaming" CSV parse that collects validated rows in memory for a bulk insert still loads the entire file into RAM before writing.

HTTP accepts, pipeline processes in chunks

The HTTP handler validates the file reference (URL or storage key), stores job metadata, and returns 202 with a job ID. The pipeline step reads the CSV in chunks, upserts idempotently by row ID, and records progress for resumability.

Long CSVs can be processed in multiple pipeline steps—split by row range, fan out in parallel, fan in to a summary step. Each step has its own timeout budget; failure in one step retries that step without restarting from row 1.

Serverless CSV processing patterns

Chunked processing with resumability

Split large CSVs into row-range pipeline steps. A failure at row 80k resumes from that checkpoint, not from row 1.

Parallel batch upsert

Fan out multiple pipeline steps to process row ranges in parallel—reduce total import time for large files.

Idempotent row upsert

Use a stable row identifier (external ID or row hash) as the upsert key. Re-running the import on the same file produces the same database state.

Progress tracking and notification

HTTP handler returns a job ID. Client polls a status endpoint; final pipeline step notifies via webhook or email when complete.

Serverless CSV import flow

1

HTTP handler accepts file reference, returns 202

Validate the file URL or storage key. Store job record with status=pending. Trigger pipeline with file reference and job ID.

2

Pipeline step reads and processes CSV

Download and parse the CSV in the pipeline step. Upsert rows in batches of 500–1000 with idempotency keys. Update progress on job record.

3

Final step notifies

After last batch, mark job complete and notify the user via email, webhook, or status update.

Chunked CSV import pipeline

HTTP handler returns 202; pipeline step processes the file in batches. Idempotency key prevents duplicate rows on retry.

api/import-csv.mjs (HTTP handler)
export async function handler(event) {
  const { fileUrl, importId } = JSON.parse(event.body || '{}');
  if (!fileUrl || !importId) return { statusCode: 400, body: JSON.stringify({ error: 'fileUrl and importId required' }) };
  await db.imports.create({ id: importId, status: 'pending', fileUrl });
  await global.durable.startNew('process-csv', undefined, { fileUrl, importId });
  return { statusCode: 202, body: JSON.stringify({ importId, status: 'pending' }) };
}
jobs/process-csv.mjs (pipeline step)
import { parse } from 'csv-parse/sync';

export async function handler(event) {
  const { fileUrl, importId } = event.payload ?? {};
  const csvText = await fetch(fileUrl).then((r) => r.text());
  const rows = parse(csvText, { columns: true, skip_empty_lines: true });
  let inserted = 0;
  for (const batch of chunk(rows, 500)) {
    // Upsert by external_id — idempotent on retry
    await db.records.upsertBatch(batch.map((r) => ({ ...r, importId })));
    inserted += batch.length;
  }
  await db.imports.update(importId, { status: 'done', rowCount: rows.length });
  return { importId, rows: rows.length, inserted };
}

Use serverless CSV processing for

When this works

  • Customer data imports, product catalog uploads, bulk user migrations
  • Regular CSV exports from external systems that need nightly or triggered processing

When to skip it

  • Tiny CSV files under 1000 rows that process in under 5 seconds—keep those synchronous for simpler debugging

FAQ

How do I handle CSV validation errors?

Collect validation errors per row and store them on the import job record. Return a summary (valid rows, invalid rows, error list) when the job completes. Let the client decide whether to proceed with partial import or fix errors first.

How do I resume a failed import?

Track the last successfully processed row offset on the job record. Re-trigger the pipeline with offset; the step skips already-processed rows. Use upsert-on-external-id for safety.

Inquir Compute logoInquir Compute

The simplest way to run AI agents and backend jobs without infrastructure.

Contact info@inquir.org

© 2025 Inquir Compute. All rights reserved.