Serverless ONNX model inference on Python 3.12
Run a trained model behind an HTTP route without standing up a model server. The Python runtime is Python 3.12 on glibc (Debian slim), so manylinux wheels install cleanly and <code>onnxruntime</code>, <code>onnx</code>, and <code>scikit-learn</code> are available as attachable layers. Ship the model file inside the function—it lives next to the handler in <code>/var/task</code>—load it once on cold start, and reuse it across warm invocations. This is CPU inference: there is no GPU.
Last updated: 2026-06-25
- Python 3.12 on glibc (Debian slim): manylinux wheels work, no edge-isolate limits
- Attachable layers: py-onnxruntime, py-onnx, py-scikit-learn
- Model ships in /var/task next to the handler—loaded once on cold start, reused warm
- 256MB default memory (up to 2GB); 5s default timeout, 15min max
Answer first
Direct answer
Serverless ONNX model inference on Python 3.12. Inquir runs Python 3.12 in containers on glibc (Debian slim), so manylinux wheels install the way they do on a normal Linux box. onnxruntime, onnx, and scikit-learn are available as attachable layers (py-onnxruntime, py-onnx, py-scikit-learn): attach the layer to the function instead of bundling the wheel into every deploy.
When it fits
- Request-driven inference for a small-to-medium model exported to ONNX (classifier, ranker, embedding head)
- scikit-learn or other CPU models you want behind an authenticated HTTP route without running a model server
- Bursty or low-volume scoring where paying for an always-on GPU or container is wasteful
Tradeoffs
- Edge and V8-isolate platforms do not run native ML wheels at all—
onnxruntimeships a compiled binary, and isolates cannot load it. So inference that depends on a real CPython process is off the table on those runtimes regardless of how small the model is. - Even on platforms that run Python, a musl-based (Alpine) image breaks the prebuilt manylinux wheels that
onnxruntimeandscikit-learnpublish, forcing slow source builds or pinned older versions. A glibc base is what lets those wheels install as published.
Workload and what breaks
What it costs to serve one trained model
You have a trained model—a scikit-learn classifier exported to ONNX, a small gradient-boosted ranker, an embedding head. Serving it usually means running a container or a model server 24/7, sizing it for peak traffic, and paying for idle time between requests. For request-driven inference that is mostly bursty, that is a lot of standing infrastructure for a function that loads a file and calls session.run().
The model itself is small and the inference code is short. What is missing is a place to run it on demand: install the runtime, keep the model in memory between calls, expose it behind an authenticated HTTP route, and not pay when nothing is calling it.
Trade-offs
Why edge and minimal runtimes block model inference
Edge and V8-isolate platforms do not run native ML wheels at all—onnxruntime ships a compiled binary, and isolates cannot load it. So inference that depends on a real CPython process is off the table on those runtimes regardless of how small the model is.
Even on platforms that run Python, a musl-based (Alpine) image breaks the prebuilt manylinux wheels that onnxruntime and scikit-learn publish, forcing slow source builds or pinned older versions. A glibc base is what lets those wheels install as published.
How Inquir helps
A glibc Python runtime with attachable ML layers
Inquir runs Python 3.12 in containers on glibc (Debian slim), so manylinux wheels install the way they do on a normal Linux box. onnxruntime, onnx, and scikit-learn are available as attachable layers (py-onnxruntime, py-onnx, py-scikit-learn): attach the layer to the function instead of bundling the wheel into every deploy.
The model file ships inside the function bundle and lives next to the handler in /var/task, or you can mount it as a layer. Load it once at module scope—on cold start—and the warm container pool (min 1, up to 8 per function) reuses the loaded session across invocations, so steady traffic does not reload the model each call. Inference runs on CPU: there is no GPU on the platform, so this fits small-to-medium models, not large GPU-bound networks.
What you get
What serverless ONNX inference covers
Attach the runtime as a layer
Add py-onnxruntime (and py-onnx or py-scikit-learn as needed) to the function instead of bundling the wheel into every deploy. The glibc base means the published manylinux wheel loads as-is.
Model loaded once on cold start
Build the InferenceSession at module scope. The first invocation in a fresh container pays the load; warm containers in the pool reuse the session, so most requests skip model loading entirely.
Inference behind an authenticated route
Wire the handler to a gateway route with api-key or bearer auth. Callers POST input features as JSON; the handler returns predictions. Memory defaults to 256MB and can be raised to 2GB for larger models or batches.
Batch or async inference
For a single prediction, return synchronously inside the 5s default timeout. For bulk scoring that runs longer, accept the request, return 202, and run inference in a pipeline step (up to 15 minutes per step).
What to do next
How to serve an ONNX model as a function
Export the model to ONNX, ship it with the handler, attach the runtime layer, and expose it behind a route.
Export and ship the model
Export your trained model to ONNX (e.g. with skl2onnx for a scikit-learn estimator). Place the .onnx file in the function bundle so it deploys to /var/task next to the handler.
Attach the runtime and load once
Attach the py-onnxruntime layer. Build the InferenceSession at module scope, pointing at the model path in /var/task, so it loads on cold start and stays resident in warm containers.
Expose and observe
Connect the handler to a gateway route with api-key auth. Each invocation produces a run record—duration, status, logs—so you can watch cold-start vs warm latency in execution history.
Code example
ONNX inference handler: load once, run per request
The model is loaded at import time (cold start) and reused by every warm invocation. The handler reads features from the request body and returns predictions. The model file lives next to this handler in /var/task.
# Python 3.12 on glibc (Debian slim) — manylinux wheels install as published. # Prefer attaching the shared layers (py-onnxruntime, py-numpy) over bundling. onnxruntime numpy
import os import json import numpy as np import onnxruntime as ort # Loaded ONCE on cold start, at module scope. # The model ships inside the function and lives next to this handler in /var/task. MODEL_PATH = os.path.join(os.path.dirname(__file__), "model.onnx") _session = ort.InferenceSession(MODEL_PATH, providers=["CPUExecutionProvider"]) # CPU only — no GPU _input_name = _session.get_inputs()[0].name def handler(event, context): body = json.loads(event.get("body") or "{}") features = body.get("features") if not features: return {"statusCode": 400, "body": json.dumps({"error": "features required"})} x = np.asarray(features, dtype=np.float32) if x.ndim == 1: x = x.reshape(1, -1) # single row -> batch of 1 # Warm containers reuse _session — no per-request model load. outputs = _session.run(None, {_input_name: x}) predictions = outputs[0].tolist() return {"statusCode": 200, "body": json.dumps({"predictions": predictions})}
When it fits
When serverless ONNX inference fits
When this works
- Request-driven inference for a small-to-medium model exported to ONNX (classifier, ranker, embedding head)
- scikit-learn or other CPU models you want behind an authenticated HTTP route without running a model server
- Bursty or low-volume scoring where paying for an always-on GPU or container is wasteful
When to skip it
- Large GPU-bound models or low-latency, high-throughput serving—there is no GPU, and CPU cold starts are real
FAQ
FAQ
Which Python version and base image runs the model?
Python 3.12 on glibc (Debian slim). Because the base is glibc, the manylinux wheels that onnxruntime, onnx, and scikit-learn publish install and load as-is—no source builds and no Alpine/musl wheel breakage.
How do I get onnxruntime into the function?
Attach the py-onnxruntime layer (with py-onnx or py-scikit-learn if you need them) to the function instead of bundling the wheel into every deploy. You can also pip-install via the layer build; pure-Python deps can live in the function bundle.
Where does the model file live, and when is it loaded?
Ship the .onnx file inside the function bundle—it deploys next to the handler in /var/task—or mount it as a layer. Build the InferenceSession at module scope so it loads once on cold start; warm containers in the pool reuse the loaded session across invocations.
Is there a GPU?
No. Inference runs on CPU (CPUExecutionProvider). This fits small-to-medium models and scikit-learn-style estimators. Large GPU-bound networks and training workloads are not a fit for this platform.
What are the memory and timeout limits?
Memory defaults to 256MB and can be raised up to 2GB for larger models or batches. The function timeout defaults to 5s and can go up to 15 minutes. For bulk scoring that runs longer, return 202 and run inference in a pipeline step.