← Back to blog Engineering

Why We Chose Rust Over Python for ML Inference

Not training — inference. How Rust's zero-cost abstractions, lack of GIL, and FFI story make it the better choice for shipping ML to production devices.

Glenn Sonna
· · 7 min read
rust-mlrust-vs-pythonml-inferenceedge-aion-device-ai

“Just use Python.”

That’s the default advice for anything ML. And for training, it’s right — Python’s ecosystem (PyTorch, JAX, HuggingFace) is unmatched.

But we’re not training. We’re running inference on phones, laptops, and edge devices. And for that, Rust beats Python in every dimension that matters.

Here’s why we built Xybrid in Rust and haven’t looked back.


The Distinction: Training vs. Inference

TrainingInference
WhereGPU clusterUser’s device
WhoML engineersEvery user of your app
PriorityFlexibility, iteration speedLatency, binary size, reliability
FrequencyOnce (or periodically)Millions of times
EnvironmentControlledUncontrolled (any OS, any device)

Training and inference have fundamentally different requirements. Optimizing for one often hurts the other. Python excels at training’s requirements. Rust excels at inference’s.

Reason 1: No Runtime, No GIL, No Problem

Python inference means shipping a Python runtime with your app. On mobile, that’s a non-starter. On desktop, it’s a 30-50MB overhead plus dependency management nightmares.

Rust compiles to a single native binary. No runtime. No interpreter. No pip install on the user’s machine.

# Python inference deployment
pip install onnxruntime numpy tokenizers
# Hope the user has Python 3.9+
# Hope numpy doesn't conflict with their system version
# Hope the wheel exists for their platform

# Rust inference deployment
# Ship one binary. Done.

The GIL (Global Interpreter Lock) is Python’s other gift. Even with threading, Python can’t truly parallelize CPU-bound work. For inference pipelines with multiple stages (preprocess → model → postprocess), this means sequential execution.

Rust has no GIL. Pipeline stages can genuinely run in parallel:

// Preprocess next input while current model runs
let preprocess_handle = thread::spawn(move || preprocess(next_input));
let output = model.run(&current_input)?;
let next_preprocessed = preprocess_handle.join()?;

Reason 2: FFI Is a First-Class Citizen

We need to call our inference engine from Flutter (Dart), Swift, Kotlin, and C# (Unity). From Python, your FFI options are:

  • PyO3 — embeds a Python interpreter in Rust. Now you’re shipping Python anyway.
  • ctypes/cffi — works, but Python is the host. You can’t call Python from Swift.
  • Subprocess — shell out to a Python script. Serialization overhead kills latency.

From Rust, FFI is natural:

TargetToolLines of Binding Code
Dart (Flutter)flutter_rust_bridge~200
SwiftUniFFI~150
KotlinUniFFI~150
C# (Unity)cbindgen (C FFI)~100

Rust produces .so, .dylib, .dll, and .a files that any language can load. The type system maps cleanly to C, and tools like UniFFI generate idiomatic wrappers automatically.

Our entire Flutter SDK is a thin Dart layer over Rust. The inference, preprocessing, caching — all Rust, called via FFI. Zero Python in the dependency tree.

Reason 3: Memory Safety Without Garbage Collection

ML inference processes large tensors. A 82M parameter model produces multi-megabyte intermediate buffers. In Python (via numpy), memory management is opaque — you rely on reference counting and the GC.

Rust’s ownership model gives you:

  • Deterministic deallocation — buffers free exactly when the executor finishes, not “sometime later when the GC runs”
  • No use-after-free — the compiler prevents dangling references to deallocated tensors
  • Zero-copy where possible — pass slices instead of copying data between pipeline stages
// Tensor data freed exactly when executor scope ends
{
    let mut executor = TemplateExecutor::new();
    let output = executor.execute(&metadata, &input)?;
    // Process output...
} // All intermediate buffers freed here. Deterministic.

On memory-constrained devices (phones, embedded), this predictability matters. No GC pauses during inference. No memory spikes from deferred collection.

Reason 4: Cross-Compilation Just Works

We target 6 platforms from one codebase:

# macOS (host)
cargo build

# iOS
cargo build --target aarch64-apple-ios

# Android
cargo build --target aarch64-linux-android

# Linux
cargo build --target x86_64-unknown-linux-gnu

# Windows
cargo build --target x86_64-pc-windows-msvc

Rust’s cross-compilation story is mature. The cc crate handles C/C++ dependencies (like ONNX Runtime and llama.cpp) across toolchains. Cargo features let us conditionally compile platform-specific code:

#[cfg(target_os = "macos")]
fn setup_accelerator() {
    // CoreML + Metal
}

#[cfg(target_os = "android")]
fn setup_accelerator() {
    // CPU + NNAPI
}

Try cross-compiling a Python inference script with numpy, ONNX Runtime, and a custom tokenizer to iOS. I’ll wait.

Reason 5: Error Handling That Doesn’t Lie

Python’s approach to errors in inference code:

try:
    output = session.run(None, inputs)
except Exception as e:
    print(f"Something went wrong: {e}")
    # Good luck debugging which of 15 possible failure modes this is

Rust forces you to handle every failure mode explicitly:

pub enum ExecutionError {
    ModelNotFound(PathBuf),
    PreprocessingFailed { step: String, reason: String },
    InferenceFailed { model: String, error: ort::Error },
    PostprocessingFailed { step: String, reason: String },
    UnsupportedFormat(String),
}

Every function that can fail returns Result<T, E>. The compiler ensures you handle (or explicitly propagate) every error. No silent failures, no “it returned None but I don’t know why.”

For production inference, this is invaluable. When a model fails on a user’s device, the error message tells you exactly what went wrong.

Reason 6: Zero-Cost Abstractions

Our Envelope type (which carries data between pipeline stages) uses an enum:

pub enum EnvelopeKind {
    Audio(Vec<u8>),
    Text(String),
    Embedding(Vec<f32>),
    Tokens(Vec<i64>),
    Tensor { data: Vec<f32>, shape: Vec<usize> },
}

Pattern matching on this enum compiles to a jump table — no heap allocation, no vtable lookup, no dynamic dispatch. The abstraction costs literally nothing at runtime.

In Python, the equivalent would be:

class Envelope:
    def __init__(self, kind, data):
        self.kind = kind  # heap-allocated string
        self.data = data  # heap-allocated, dynamically typed

Every attribute access goes through the dictionary lookup. Every type check is a runtime operation. For inference pipelines that run millions of times, these costs add up.

When Python Still Wins

Let’s be fair. Python is better when:

  • Prototyping — Jupyter notebooks for testing model behavior
  • Training — PyTorch/JAX ecosystem is unmatched
  • One-off scripts — quick data processing, model conversion
  • Team expertise — if your team knows Python and not Rust

We use Python for model conversion (ONNX export), testing (comparing outputs with reference implementations), and prototyping new preprocessing steps. Then we implement the production version in Rust.

The boundary is clear: Python for research, Rust for production.

The Numbers

Our Rust inference SDK:

MetricValue
Core library size~2MB stripped binary
Cold start (model load)50-200ms depending on model
Inference overhead<1ms above raw ONNX Runtime
Memory overhead~5MB above model weights
Platforms6 (macOS, iOS, Android, Linux, Windows, WASM)
Languages served5 (Rust, Dart, Swift, Kotlin, C#)

Try shipping a Python inference runtime to iOS at 2MB.


TL;DR

PythonRust
TrainingBest choiceNot practical
InferenceWorks, but heavyIdeal
Binary size30-50MB+ runtime2MB
Cross-platformPainfulNative
FFI to mobileVery hardBuilt-in
Memory controlGC-dependentDeterministic
Error handlingRuntime exceptionsCompile-time guarantees
ConcurrencyGIL-limitedTrue parallelism

Python is for building models. Rust is for shipping them.


See it in action: github.com/xybrid-ai/xybrid — one Rust core, five platform SDKs, zero Python in production.


Agree? Disagree? Drop your take in the comments.


Related

Related articles

· 7 min read

Edge AI: Why Edge-First Should Be Your Default Architecture

On-device AI should be the starting position, not the optimization. The industry defaults to cloud out of habit, not necessity.

edge-aion-device-aiedge-first
· 3 min read

Run AI Models On-Device — Zero Config, Five Minutes

CLI, Rust, Flutter, Swift, Kotlin, Unity — run 25+ ML models on-device with one command. No tensor shapes, no preprocessing scripts.

on-device-airun-ml-locallyrust-ml
· 12 min read

On-Device AI for Mobile Apps: A Developer's Guide to iOS and Android

How to add on-device AI to iOS and Android apps — comparing CoreML, ONNX Runtime, TensorFlow Lite, and llama.cpp. Practical guide with model sizes, performance benchmarks, and integration patterns.

on-device-aimobile-aiai-on-iphone