Serverless ML inference in Python: ONNX on CPU, load once

Serve a trained model behind an HTTP route without running a model server. The load-once-at-module-scope pattern, ONNX and scikit-learn attached as layers, memory and timeout budgets, warm pools, and the honest CPU-only limits.

Serverless ML inference in Python: ONNX on CPU, load once

Serverless ML inference without running a model server

You have a trained model. Maybe it is a scikit-learn classifier, a small gradient-boosted ranker, or an embedding head exported to ONNX. The inference code itself is short: load a file, call session.run(), return a JSON array. The awkward part is everything wrapped around it. The usual way to serve that model is to keep a container or a dedicated model server running 24/7, size it for peak traffic, and pay for the idle time between requests. For request-driven, bursty inference, that is a lot of standing infrastructure around a function that mostly waits.

Serverless ml inference flips that around. You deploy the handler, wire it to an authenticated HTTP route, and the platform runs it on demand — no server to keep warm by hand, no autoscaling group to tune. On Inquir Compute this runs on container-backed Python 3.12, one container per function, with per-function dependency isolation. The model file lives inside the function bundle, the runtime loads it once, and warm containers reuse it across requests.

This post is the honest version of that story: cpu inference only (there is no GPU on this platform), the load-once-at-module-scope pattern that keeps warm requests fast, ONNX and scikit-learn attached as layers rather than baked into every deploy, and the memory and timeout budgets you have to fit inside. If your model is small-to-medium and your traffic is spiky, model inference serverless is a genuinely good fit. If you need GPU throughput or single-digit-millisecond tail latency at high QPS, it is not — and I will say so plainly in the limits section.

Why a glibc Python 3.12 runtime matters for ONNX serverless

The reason so much serverless machine learning advice starts with “it depends on your runtime” is that the runtime decides whether your wheels even load. Two things about Inquir’s Python runtime make onnx serverless work cleanly.

First, it is a real CPython process, not a V8 isolate. Edge and isolate platforms do not run native ML wheels at all — onnxruntime ships a compiled binary, and an isolate simply cannot load it. Anything that depends on a real CPython interpreter with C extensions is off the table on those runtimes, no matter how small the model is. Container-backed Python does not have that problem: the handler runs in an ordinary Linux process.

Second, the base image is Python 3.12 on glibc (Debian slim), not musl/Alpine. This matters more than it sounds. onnxruntime, onnx, and scikit-learn all publish prebuilt manylinux wheels. On a glibc base those wheels install and load exactly as published — the same way they do on a normal Linux box or in a CI runner. On a musl-based image the prebuilt manylinux wheels do not match, so you either fall back to slow source builds or pin to older versions that happen to have musl wheels. A glibc runtime is what lets you pip install onnxruntime and get the fast, tested, precompiled binary without a fight.

Put together: a real interpreter plus a glibc base is the boring foundation that makes python serverless ml predictable. Your local pip install, your CI, and the deployed function all resolve the same wheels — exactly the property you want when the thing you ship is a numerically sensitive model.

The load-once pattern: build the session at module scope

Here is the single most important idea for fast cpu inference on a serverless runtime: load the model once, at import time, and reuse it.

A serverless function is not a fresh process per request. On Inquir, hot/warm container pools are on by default — the platform keeps at least one warm container per function and can scale up to eight. The first request into a fresh container is a cold start: the interpreter boots, your module is imported, and any top-level code runs. Every subsequent request that lands on that same warm container skips all of that and calls straight into your handler function.

So where you build the InferenceSession decides your latency profile. If you build it inside the handler, you re-read the .onnx file and reconstruct the session on every single call — cold or warm — and you pay model-load latency on every request. If you build it at module scope (top level of the file), it is constructed exactly once per container, on cold start, and then every warm invocation reuses the already-loaded session. Model loading disappears from the hot path.

The model file itself ships inside the function bundle. It deploys next to the handler in /var/task, so you can resolve its path relative to __file__ and load it directly from disk. No network call, no object store fetch — which fits nicely with the platform’s default that outbound network is disabled and the root filesystem is read-only (/tmp is a writable tmpfs). Reading a local model file needs none of that.

Here is a realistic handler that follows the pattern end to end:

import os
import json
import time
import numpy as np
import onnxruntime as ort

# ---- Loaded ONCE on cold start, at module scope ----
# The .onnx model ships inside the function bundle and lives next to
# this handler in /var/task. Building the session here means it loads
# on the cold start of a fresh container, then every warm invocation
# in the pool reuses it — no per-request model load.
MODEL_PATH = os.path.join(os.path.dirname(__file__), "model.onnx")

_options = ort.SessionOptions()
# CPU inference: cap intra-op threads so one heavy request does not
# contend for every core in the container.
_options.intra_op_num_threads = int(os.environ.get("ORT_THREADS", "1"))

# CPU only — there is no GPU on the platform.
_session = ort.InferenceSession(
    MODEL_PATH,
    sess_options=_options,
    providers=["CPUExecutionProvider"],
)
_input_name = _session.get_inputs()[0].name
_loaded_at = time.time()  # constant across warm calls => proof of load-once


def handler(event, context):
    body = json.loads(event.get("body") or "{}")
    features = body.get("features")
    if features is None:
        return {
            "statusCode": 400,
            "body": json.dumps({"error": "features required"}),
        }

    # Accept a single row or a batch; onnxruntime wants a 2-D float32 array.
    x = np.asarray(features, dtype=np.float32)
    if x.ndim == 1:
        x = x.reshape(1, -1)  # single row -> batch of 1

    started = time.perf_counter()
    outputs = _session.run(None, {_input_name: x})  # warm: reuses _session
    elapsed_ms = round((time.perf_counter() - started) * 1000, 2)

    return {
        "statusCode": 200,
        "body": json.dumps(
            {
                "predictions": outputs[0].tolist(),
                "rows": int(x.shape[0]),
                "inference_ms": elapsed_ms,
                "session_loaded_at": _loaded_at,
            }
        ),
    }

A useful trick is in that last field. session_loaded_at is captured once, when the session is built. If you hit the route repeatedly and the timestamp stays the same, you are landing on a warm container that never reloaded the model. If it jumps, you hit a cold start. It is a cheap, honest way to watch the load-once pattern actually working in production.

Attaching onnxruntime and scikit-learn as layers

The handler above imports onnxruntime and numpy. You do not want to bundle those wheels into every deploy of your function — they are large, and re-uploading them on every code change is wasteful. The right mechanism is a layer.

Layers on Inquir are shared dependency bundles for Node, Python, and Go, mounted into the container at invoke time. For a Python inference function you declare your ML dependencies in the layer’s requirements.txt and attach that layer to the function. Because the base is glibc, the published manylinux wheels install into the layer exactly as they would locally:

# layer requirements.txt — attached to the function, not bundled per deploy
onnxruntime
numpy

Attach onnxruntime (plus numpy, and onnx/scikit-learn if you need them) as a layer, and your function bundle stays tiny: just the handler and the .onnx model file. Deploys are fast because you are only shipping code, while the heavy compiled dependencies live in a layer you attach. Treat the layer as the place ML libraries live, and the bundle as the place your handler and model live.

Two honest notes. First, I am describing the mechanism — you attach these libraries as a layer (or declare them in the layer’s requirements.txt); do not assume any specific package is preinstalled by default in the base image. Pin versions in the layer and you control exactly what loads. Second, on how the model reaches ONNX: for a scikit-learn estimator, export it with skl2onnx, which converts a fitted pipeline into an .onnx graph you then serve with onnxruntime. If you would rather keep the estimator native, attach scikit-learn as a layer and call predict() directly — but exporting to ONNX gives you a portable, self-contained artifact and usually faster CPU inference, which is why it is the path I lead with for onnx serverless.

Budgeting memory and timeout for CPU inference

Serverless functions run inside real limits, and for model inference the two that bite are memory and timeout. Know the numbers before you deploy.

Memory defaults to 256MB and is configurable from 64MB up to 2GB (2048MB) per function. The model plus the runtime plus numpy buffers all have to fit. A small scikit-learn classifier or a modest gradient-boosted tree exported to ONNX fits comfortably in the default. A larger model, or batch inference that allocates big intermediate arrays, is exactly when you raise the memory ceiling toward 2GB. If a model needs more than 2GB resident, it does not fit this platform — that is a hard edge, not a tuning knob.

Timeout defaults to 5 seconds and can be raised to a maximum of 15 minutes (900,000ms). For a single prediction on a small-to-medium CPU model, 5 seconds is plenty — you return synchronously inside the request. The interesting case is bulk scoring. If you are scoring thousands of rows and the work legitimately runs long, do not try to cram it into one synchronous HTTP call. Instead, accept the request, return 202 Accepted, and run the inference as a background pipeline step, which gets the full 15-minute budget per step. That keeps the request-facing route snappy while the heavy scoring runs behind it.

Two more limits matter for inference specifically. The request body defaults to 2MB, so a very large feature payload has to be chunked. And the stored result JSON is capped at 64KB of characters — fine for a prediction or small batch, but for a huge batch, page the results or write them to your own store rather than returning one giant JSON blob. These are quiet limits that only hurt if you learn them the hard way, so budget for them up front.

Warm pools and cold starts in serverless machine learning

The load-once pattern only pays off because warm containers stick around. It is worth understanding exactly how the pool behaves so your latency expectations are calibrated.

By default the platform keeps a minimum of 1 warm container per function and scales up to a maximum of 8 under concurrency. A warm container is evicted after roughly 5 minutes idle above the minimum, even the minimum container is removed after about 10 minutes fully idle, and containers recycle after 1,000 invocations. In practice: under steady traffic your model stays loaded and the vast majority of requests are warm, hitting only the session.run() path. After a quiet period, or when traffic spikes past the pool size and new containers spin up, you pay a cold start — interpreter boot plus your module-scope model load — on those specific requests.

This is the honest framing the platform itself uses: cold starts are not zero. Warm pools shrink them dramatically and keep them off the steady-state path, but the first invocation into a fresh container, or the first after an idle window, pays the load. For cpu inference the model-load portion of a cold start scales with model size — a bigger .onnx file takes longer to construct into a session — which is one more reason to keep models lean.

Because every invocation produces a run record with duration, status, and logs, you can watch this directly. Compare cold-start latency against warm latency in the execution history, and the session_loaded_at field tells you which bucket each request fell into. Frequent cold starts mean your traffic is bursty relative to the idle-eviction windows — a case for keeping the function warm with light periodic traffic, or accepting the occasional cold hit as the price of paying nothing while idle.

Honest limits: CPU only, model size, and latency

Every honest serverless machine learning page needs a section that says what this is not for. Here it is, plainly.

There is no GPU. Inference runs on CPU via CPUExecutionProvider, full stop. Nothing about this platform accelerates matrix math on a GPU, so do not size a workload assuming one. That rules out large transformer models, high-resolution vision networks, and anything whose latency budget only closes with GPU throughput. It also rules out training — this is an inference-serving story, not a place to fit models.

Model size and memory are a hard boundary. The model, runtime, and working buffers must fit inside the per-function memory ceiling, which tops out at 2GB. Small-to-medium models — scikit-learn estimators, gradient-boosted trees, compact embedding heads, quantized ONNX graphs — are the sweet spot. If your artifact is multiple gigabytes, this is the wrong tool.

Latency is CPU latency, and cold starts are real. For request-driven, bursty inference, warm CPU latency on a small model is perfectly reasonable and the load-once pattern keeps it consistent. But if you need single-digit-millisecond p99 at high, sustained QPS, a dedicated always-on server sized for that load will beat an on-demand function that occasionally cold-starts. Do not promise yourself GPU numbers or zero cold starts; neither exists here.

No durable orchestration. Long or multi-stage scoring belongs in background pipeline steps (each capped at 15 minutes), not in a durable workflow engine — treat inference as a stateless call, keep handlers idempotent, and chain steps when the work outgrows a single request.

Where it does fit is a large and common space: a trained model you want behind an authenticated HTTP route, driven by spiky or low-volume traffic, where paying for an always-on GPU or an idle container is pure waste. For that shape of workload, model inference serverless on CPU is not a compromise — it is the right amount of infrastructure.

Takeaway

Serverless ml inference on Inquir is deliberately unglamorous, and that is the point. You run a trained model on container-backed Python 3.12 on glibc, so onnxruntime and scikit-learn install from their manylinux wheels the way they do everywhere else. You attach those libraries as a layer instead of bundling them per deploy, ship the .onnx model inside the function next to the handler, and build the InferenceSession once at module scope so cold starts pay the load and warm containers reuse it. You budget inside 256MB (up to 2GB) of memory and a 5-second-to-15-minute timeout, push bulk scoring into a pipeline step, and you watch cold-vs-warm latency in the run history instead of guessing.

And you stay honest about the edges: cpu inference only, no GPU, model size capped by memory, cold starts real but bounded by warm pools. For a small-to-medium onnx serverless model serving bursty traffic behind an authenticated route, that trade is excellent — on-demand python serverless ml with no server to babysit, and nothing to pay while nothing is calling. Export a model to ONNX, attach onnxruntime as a layer, load it once, and put it behind a route.