← Back to blog Engineering

Why We Chose Rust Over Python for ML Inference

Not training — inference. How Rust's zero-cost abstractions, lack of GIL, and FFI story make it the better choice for shipping ML to production devices.

Glenn Sonna

· June 26, 2026 · 7 min read

rust-mlrust-vs-pythonml-inferenceedge-aion-device-ai

“Just use Python.”

That’s the default advice for anything ML. And for training, it’s right — Python’s ecosystem (PyTorch, JAX, HuggingFace) is unmatched.

But we’re not training. We’re running inference on phones, laptops, and edge devices. And for that, Rust beats Python in every dimension that matters.

Here’s why we built Xybrid in Rust and haven’t looked back.

The Distinction: Training vs. Inference

	Training	Inference
Where	GPU cluster	User’s device
Who	ML engineers	Every user of your app
Priority	Flexibility, iteration speed	Latency, binary size, reliability
Frequency	Once (or periodically)	Millions of times
Environment	Controlled	Uncontrolled (any OS, any device)

Training and inference have fundamentally different requirements. Optimizing for one often hurts the other. Python excels at training’s requirements. Rust excels at inference’s.

Reason 1: No Runtime, No GIL, No Problem

Python inference means shipping a Python runtime with your app. On mobile, that’s a non-starter. On desktop, it’s a 30-50MB overhead plus dependency management nightmares.

Rust compiles to a single native binary. No runtime. No interpreter. No pip install on the user’s machine.

# Python inference deployment
pip install onnxruntime numpy tokenizers
# Hope the user has Python 3.9+
# Hope numpy doesn't conflict with their system version
# Hope the wheel exists for their platform

# Rust inference deployment
# Ship one binary. Done.

The GIL (Global Interpreter Lock) is Python’s other gift. Even with threading, Python can’t truly parallelize CPU-bound work. For inference pipelines with multiple stages (preprocess → model → postprocess), this means sequential execution.

Rust has no GIL. Pipeline stages can genuinely run in parallel:

// Preprocess next input while current model runs
let preprocess_handle = thread::spawn(move || preprocess(next_input));
let output = model.run(&current_input)?;
let next_preprocessed = preprocess_handle.join()?;

Reason 2: FFI Is a First-Class Citizen

We need to call our inference engine from Flutter (Dart), Swift, Kotlin, and C# (Unity). From Python, your FFI options are:

PyO3 — embeds a Python interpreter in Rust. Now you’re shipping Python anyway.
ctypes/cffi — works, but Python is the host. You can’t call Python from Swift.
Subprocess — shell out to a Python script. Serialization overhead kills latency.

From Rust, FFI is natural:

Target	Tool	Lines of Binding Code
Dart (Flutter)	flutter_rust_bridge	~200
Swift	UniFFI	~150
Kotlin	UniFFI	~150
C# (Unity)	cbindgen (C FFI)	~100

Rust produces .so, .dylib, .dll, and .a files that any language can load. The type system maps cleanly to C, and tools like UniFFI generate idiomatic wrappers automatically.

Our entire Flutter SDK is a thin Dart layer over Rust. The inference, preprocessing, caching — all Rust, called via FFI. Zero Python in the dependency tree.

Reason 3: Memory Safety Without Garbage Collection

ML inference processes large tensors. A 82M parameter model produces multi-megabyte intermediate buffers. In Python (via numpy), memory management is opaque — you rely on reference counting and the GC.

Rust’s ownership model gives you:

Deterministic deallocation — buffers free exactly when the executor finishes, not “sometime later when the GC runs”
No use-after-free — the compiler prevents dangling references to deallocated tensors
Zero-copy where possible — pass slices instead of copying data between pipeline stages

// Tensor data freed exactly when executor scope ends
{
    let mut executor = TemplateExecutor::new();
    let output = executor.execute(&metadata, &input)?;
    // Process output...
} // All intermediate buffers freed here. Deterministic.

On memory-constrained devices (phones, embedded), this predictability matters. No GC pauses during inference. No memory spikes from deferred collection.

Reason 4: Cross-Compilation Just Works

We target 6 platforms from one codebase:

# macOS (host)
cargo build

# iOS
cargo build --target aarch64-apple-ios

# Android
cargo build --target aarch64-linux-android

# Linux
cargo build --target x86_64-unknown-linux-gnu

# Windows
cargo build --target x86_64-pc-windows-msvc

Rust’s cross-compilation story is mature. The cc crate handles C/C++ dependencies (like ONNX Runtime and llama.cpp) across toolchains. Cargo features let us conditionally compile platform-specific code:

#[cfg(target_os = "macos")]
fn setup_accelerator() {
    // CoreML + Metal
}

#[cfg(target_os = "android")]
fn setup_accelerator() {
    // CPU + NNAPI
}

Try cross-compiling a Python inference script with numpy, ONNX Runtime, and a custom tokenizer to iOS. I’ll wait.

Reason 5: Error Handling That Doesn’t Lie

Python’s approach to errors in inference code:

try:
    output = session.run(None, inputs)
except Exception as e:
    print(f"Something went wrong: {e}")
    # Good luck debugging which of 15 possible failure modes this is

Rust forces you to handle every failure mode explicitly:

pub enum ExecutionError {
    ModelNotFound(PathBuf),
    PreprocessingFailed { step: String, reason: String },
    InferenceFailed { model: String, error: ort::Error },
    PostprocessingFailed { step: String, reason: String },
    UnsupportedFormat(String),
}

Every function that can fail returns Result<T, E>. The compiler ensures you handle (or explicitly propagate) every error. No silent failures, no “it returned None but I don’t know why.”

For production inference, this is invaluable. When a model fails on a user’s device, the error message tells you exactly what went wrong.

Reason 6: Zero-Cost Abstractions

Our Envelope type (which carries data between pipeline stages) uses an enum:

pub enum EnvelopeKind {
    Audio(Vec<u8>),
    Text(String),
    Embedding(Vec<f32>),
    Tokens(Vec<i64>),
    Tensor { data: Vec<f32>, shape: Vec<usize> },
}

Pattern matching on this enum compiles to a jump table — no heap allocation, no vtable lookup, no dynamic dispatch. The abstraction costs literally nothing at runtime.

In Python, the equivalent would be:

class Envelope:
    def __init__(self, kind, data):
        self.kind = kind  # heap-allocated string
        self.data = data  # heap-allocated, dynamically typed

Every attribute access goes through the dictionary lookup. Every type check is a runtime operation. For inference pipelines that run millions of times, these costs add up.

When Python Still Wins

Let’s be fair. Python is better when:

Prototyping — Jupyter notebooks for testing model behavior
Training — PyTorch/JAX ecosystem is unmatched
One-off scripts — quick data processing, model conversion
Team expertise — if your team knows Python and not Rust

We use Python for model conversion (ONNX export), testing (comparing outputs with reference implementations), and prototyping new preprocessing steps. Then we implement the production version in Rust.

The boundary is clear: Python for research, Rust for production.

The Numbers

Our Rust inference SDK:

Metric	Value
Core library size	~2MB stripped binary
Cold start (model load)	50-200ms depending on model
Inference overhead	<1ms above raw ONNX Runtime
Memory overhead	~5MB above model weights
Platforms	6 (macOS, iOS, Android, Linux, Windows, WASM)
Languages served	5 (Rust, Dart, Swift, Kotlin, C#)

Try shipping a Python inference runtime to iOS at 2MB.

TL;DR

	Python	Rust
Training	Best choice	Not practical
Inference	Works, but heavy	Ideal
Binary size	30-50MB+ runtime	2MB
Cross-platform	Painful	Native
FFI to mobile	Very hard	Built-in
Memory control	GC-dependent	Deterministic
Error handling	Runtime exceptions	Compile-time guarantees
Concurrency	GIL-limited	True parallelism

Python is for building models. Rust is for shipping them.

See it in action: github.com/xybrid-ai/xybrid — one Rust core, five platform SDKs, zero Python in production.

Agree? Disagree? Drop your take in the comments.

On-Device AI: The Complete Guide — why on-device inference is viable and how to get started.
One Codebase, Five Platforms — how the Rust core powers Flutter, Swift, Kotlin, Unity, and CLI.
Edge AI vs Cloud AI — the decision framework for where to run each model.

Jun 19, 2026 · 7 min read

Edge AI: Why Edge-First Should Be Your Default Architecture

On-device AI should be the starting position, not the optimization. The industry defaults to cloud out of habit, not necessity.

edge-aion-device-aiedge-first

Jun 2, 2026 · 3 min read

Run AI Models On-Device — Zero Config, Five Minutes

CLI, Rust, Flutter, Swift, Kotlin, Unity — run 25+ ML models on-device with one command. No tensor shapes, no preprocessing scripts.

on-device-airun-ml-locallyrust-ml

Jun 23, 2026 · 12 min read

On-Device AI for Mobile Apps: A Developer's Guide to iOS and Android

How to add on-device AI to iOS and Android apps — comparing CoreML, ONNX Runtime, TensorFlow Lite, and llama.cpp. Practical guide with model sizes, performance benchmarks, and integration patterns.

on-device-aimobile-aiai-on-iphone