Inquir Compute · ML inference

Serverless ONNX model inference on Python 3.12

Run a trained model behind an HTTP route without standing up a model server. The Python runtime is Python 3.12 on glibc (Debian slim), so manylinux wheels install cleanly and <code>onnxruntime</code>, <code>onnx</code>, and <code>scikit-learn</code> are available as attachable layers. Ship the model file inside the function—it lives next to the handler in <code>/var/task</code>—load it once on cold start, and reuse it across warm invocations. This is CPU inference: there is no GPU.

Last updated: 2026-06-25

  • Python 3.12 on glibc (Debian slim): manylinux wheels work, no edge-isolate limits
  • Attachable layers: py-onnxruntime, py-onnx, py-scikit-learn
  • Model ships in /var/task next to the handler—loaded once on cold start, reused warm
  • 256MB default memory (up to 2GB); 5s default timeout, 15min max

Direct answer

Serverless ONNX model inference on Python 3.12. Inquir runs Python 3.12 in containers on glibc (Debian slim), so manylinux wheels install the way they do on a normal Linux box. onnxruntime, onnx, and scikit-learn are available as attachable layers (py-onnxruntime, py-onnx, py-scikit-learn): attach the layer to the function instead of bundling the wheel into every deploy.

When it fits

  • Request-driven inference for a small-to-medium model exported to ONNX (classifier, ranker, embedding head)
  • scikit-learn or other CPU models you want behind an authenticated HTTP route without running a model server
  • Bursty or low-volume scoring where paying for an always-on GPU or container is wasteful

Tradeoffs

  • Edge and V8-isolate platforms do not run native ML wheels at all—onnxruntime ships a compiled binary, and isolates cannot load it. So inference that depends on a real CPython process is off the table on those runtimes regardless of how small the model is.
  • Even on platforms that run Python, a musl-based (Alpine) image breaks the prebuilt manylinux wheels that onnxruntime and scikit-learn publish, forcing slow source builds or pinned older versions. A glibc base is what lets those wheels install as published.

What it costs to serve one trained model

You have a trained model—a scikit-learn classifier exported to ONNX, a small gradient-boosted ranker, an embedding head. Serving it usually means running a container or a model server 24/7, sizing it for peak traffic, and paying for idle time between requests. For request-driven inference that is mostly bursty, that is a lot of standing infrastructure for a function that loads a file and calls session.run().

The model itself is small and the inference code is short. What is missing is a place to run it on demand: install the runtime, keep the model in memory between calls, expose it behind an authenticated HTTP route, and not pay when nothing is calling it.

Why edge and minimal runtimes block model inference

Edge and V8-isolate platforms do not run native ML wheels at all—onnxruntime ships a compiled binary, and isolates cannot load it. So inference that depends on a real CPython process is off the table on those runtimes regardless of how small the model is.

Even on platforms that run Python, a musl-based (Alpine) image breaks the prebuilt manylinux wheels that onnxruntime and scikit-learn publish, forcing slow source builds or pinned older versions. A glibc base is what lets those wheels install as published.

A glibc Python runtime with attachable ML layers

Inquir runs Python 3.12 in containers on glibc (Debian slim), so manylinux wheels install the way they do on a normal Linux box. onnxruntime, onnx, and scikit-learn are available as attachable layers (py-onnxruntime, py-onnx, py-scikit-learn): attach the layer to the function instead of bundling the wheel into every deploy.

The model file ships inside the function bundle and lives next to the handler in /var/task, or you can mount it as a layer. Load it once at module scope—on cold start—and the warm container pool (min 1, up to 8 per function) reuses the loaded session across invocations, so steady traffic does not reload the model each call. Inference runs on CPU: there is no GPU on the platform, so this fits small-to-medium models, not large GPU-bound networks.

What serverless ONNX inference covers

Attach the runtime as a layer

Add py-onnxruntime (and py-onnx or py-scikit-learn as needed) to the function instead of bundling the wheel into every deploy. The glibc base means the published manylinux wheel loads as-is.

Model loaded once on cold start

Build the InferenceSession at module scope. The first invocation in a fresh container pays the load; warm containers in the pool reuse the session, so most requests skip model loading entirely.

Inference behind an authenticated route

Wire the handler to a gateway route with api-key or bearer auth. Callers POST input features as JSON; the handler returns predictions. Memory defaults to 256MB and can be raised to 2GB for larger models or batches.

Batch or async inference

For a single prediction, return synchronously inside the 5s default timeout. For bulk scoring that runs longer, accept the request, return 202, and run inference in a pipeline step (up to 15 minutes per step).

How to serve an ONNX model as a function

Export the model to ONNX, ship it with the handler, attach the runtime layer, and expose it behind a route.

1

Export and ship the model

Export your trained model to ONNX (e.g. with skl2onnx for a scikit-learn estimator). Place the .onnx file in the function bundle so it deploys to /var/task next to the handler.

2

Attach the runtime and load once

Attach the py-onnxruntime layer. Build the InferenceSession at module scope, pointing at the model path in /var/task, so it loads on cold start and stays resident in warm containers.

3

Expose and observe

Connect the handler to a gateway route with api-key auth. Each invocation produces a run record—duration, status, logs—so you can watch cold-start vs warm latency in execution history.

ONNX inference handler: load once, run per request

The model is loaded at import time (cold start) and reused by every warm invocation. The handler reads features from the request body and returns predictions. The model file lives next to this handler in /var/task.

requirements.txt (or attach the py-onnxruntime layer)
# Python 3.12 on glibc (Debian slim) — manylinux wheels install as published.
# Prefer attaching the shared layers (py-onnxruntime, py-numpy) over bundling.
onnxruntime
numpy
infer.py (Python 3.12 handler)
import os
import json
import numpy as np
import onnxruntime as ort

# Loaded ONCE on cold start, at module scope.
# The model ships inside the function and lives next to this handler in /var/task.
MODEL_PATH = os.path.join(os.path.dirname(__file__), "model.onnx")
_session = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"])  # CPU only — no GPU
_input_name = _session.get_inputs()[0].name


def handler(event, context):
    body = json.loads(event.get("body") or "{}")
    features = body.get("features")
    if not features:
        return {"statusCode": 400, "body": json.dumps({"error": "features required"})}

    x = np.asarray(features, dtype=np.float32)
    if x.ndim == 1:
        x = x.reshape(1, -1)  # single row -> batch of 1

    # Warm containers reuse _session — no per-request model load.
    outputs = _session.run(None, {_input_name: x})
    predictions = outputs[0].tolist()
    return {"statusCode": 200, "body": json.dumps({"predictions": predictions})}

When serverless ONNX inference fits

When this works

  • Request-driven inference for a small-to-medium model exported to ONNX (classifier, ranker, embedding head)
  • scikit-learn or other CPU models you want behind an authenticated HTTP route without running a model server
  • Bursty or low-volume scoring where paying for an always-on GPU or container is wasteful

When to skip it

  • Large GPU-bound models or low-latency, high-throughput serving—there is no GPU, and CPU cold starts are real

FAQ

Which Python version and base image runs the model?

Python 3.12 on glibc (Debian slim). Because the base is glibc, the manylinux wheels that onnxruntime, onnx, and scikit-learn publish install and load as-is—no source builds and no Alpine/musl wheel breakage.

How do I get onnxruntime into the function?

Attach the py-onnxruntime layer (with py-onnx or py-scikit-learn if you need them) to the function instead of bundling the wheel into every deploy. You can also pip-install via the layer build; pure-Python deps can live in the function bundle.

Where does the model file live, and when is it loaded?

Ship the .onnx file inside the function bundle—it deploys next to the handler in /var/task—or mount it as a layer. Build the InferenceSession at module scope so it loads once on cold start; warm containers in the pool reuse the loaded session across invocations.

Is there a GPU?

No. Inference runs on CPU (CPUExecutionProvider). This fits small-to-medium models and scikit-learn-style estimators. Large GPU-bound networks and training workloads are not a fit for this platform.

What are the memory and timeout limits?

Memory defaults to 256MB and can be raised up to 2GB for larger models or batches. The function timeout defaults to 5s and can go up to 15 minutes. For bulk scoring that runs longer, return 202 and run inference in a pipeline step.