“Just use Python.”
That’s the default advice for anything ML. And for training, it’s right — Python’s ecosystem (PyTorch, JAX, HuggingFace) is unmatched.
But we’re not training. We’re running inference on phones, laptops, and edge devices. And for that, Rust beats Python in every dimension that matters.
Here’s why we built Xybrid in Rust and haven’t looked back.
The Distinction: Training vs. Inference
| Training | Inference | |
|---|---|---|
| Where | GPU cluster | User’s device |
| Who | ML engineers | Every user of your app |
| Priority | Flexibility, iteration speed | Latency, binary size, reliability |
| Frequency | Once (or periodically) | Millions of times |
| Environment | Controlled | Uncontrolled (any OS, any device) |
Training and inference have fundamentally different requirements. Optimizing for one often hurts the other. Python excels at training’s requirements. Rust excels at inference’s.
Reason 1: No Runtime, No GIL, No Problem
Python inference means shipping a Python runtime with your app. On mobile, that’s a non-starter. On desktop, it’s a 30-50MB overhead plus dependency management nightmares.
Rust compiles to a single native binary. No runtime. No interpreter. No pip install on the user’s machine.
# Python inference deployment
pip install onnxruntime numpy tokenizers
# Hope the user has Python 3.9+
# Hope numpy doesn't conflict with their system version
# Hope the wheel exists for their platform
# Rust inference deployment
# Ship one binary. Done. The GIL (Global Interpreter Lock) is Python’s other gift. Even with threading, Python can’t truly parallelize CPU-bound work. For inference pipelines with multiple stages (preprocess → model → postprocess), this means sequential execution.
Rust has no GIL. Pipeline stages can genuinely run in parallel:
// Preprocess next input while current model runs
let preprocess_handle = thread::spawn(move || preprocess(next_input));
let output = model.run(¤t_input)?;
let next_preprocessed = preprocess_handle.join()?; Reason 2: FFI Is a First-Class Citizen
We need to call our inference engine from Flutter (Dart), Swift, Kotlin, and C# (Unity). From Python, your FFI options are:
- PyO3 — embeds a Python interpreter in Rust. Now you’re shipping Python anyway.
- ctypes/cffi — works, but Python is the host. You can’t call Python from Swift.
- Subprocess — shell out to a Python script. Serialization overhead kills latency.
From Rust, FFI is natural:
| Target | Tool | Lines of Binding Code |
|---|---|---|
| Dart (Flutter) | flutter_rust_bridge | ~200 |
| Swift | UniFFI | ~150 |
| Kotlin | UniFFI | ~150 |
| C# (Unity) | cbindgen (C FFI) | ~100 |
Rust produces .so, .dylib, .dll, and .a files that any language can load. The type system maps cleanly to C, and tools like UniFFI generate idiomatic wrappers automatically.
Our entire Flutter SDK is a thin Dart layer over Rust. The inference, preprocessing, caching — all Rust, called via FFI. Zero Python in the dependency tree.
Reason 3: Memory Safety Without Garbage Collection
ML inference processes large tensors. A 82M parameter model produces multi-megabyte intermediate buffers. In Python (via numpy), memory management is opaque — you rely on reference counting and the GC.
Rust’s ownership model gives you:
- Deterministic deallocation — buffers free exactly when the executor finishes, not “sometime later when the GC runs”
- No use-after-free — the compiler prevents dangling references to deallocated tensors
- Zero-copy where possible — pass slices instead of copying data between pipeline stages
// Tensor data freed exactly when executor scope ends
{
let mut executor = TemplateExecutor::new();
let output = executor.execute(&metadata, &input)?;
// Process output...
} // All intermediate buffers freed here. Deterministic. On memory-constrained devices (phones, embedded), this predictability matters. No GC pauses during inference. No memory spikes from deferred collection.
Reason 4: Cross-Compilation Just Works
We target 6 platforms from one codebase:
# macOS (host)
cargo build
# iOS
cargo build --target aarch64-apple-ios
# Android
cargo build --target aarch64-linux-android
# Linux
cargo build --target x86_64-unknown-linux-gnu
# Windows
cargo build --target x86_64-pc-windows-msvc Rust’s cross-compilation story is mature. The cc crate handles C/C++ dependencies (like ONNX Runtime and llama.cpp) across toolchains. Cargo features let us conditionally compile platform-specific code:
#[cfg(target_os = "macos")]
fn setup_accelerator() {
// CoreML + Metal
}
#[cfg(target_os = "android")]
fn setup_accelerator() {
// CPU + NNAPI
} Try cross-compiling a Python inference script with numpy, ONNX Runtime, and a custom tokenizer to iOS. I’ll wait.
Reason 5: Error Handling That Doesn’t Lie
Python’s approach to errors in inference code:
try:
output = session.run(None, inputs)
except Exception as e:
print(f"Something went wrong: {e}")
# Good luck debugging which of 15 possible failure modes this is Rust forces you to handle every failure mode explicitly:
pub enum ExecutionError {
ModelNotFound(PathBuf),
PreprocessingFailed { step: String, reason: String },
InferenceFailed { model: String, error: ort::Error },
PostprocessingFailed { step: String, reason: String },
UnsupportedFormat(String),
} Every function that can fail returns Result<T, E>. The compiler ensures you handle (or explicitly propagate) every error. No silent failures, no “it returned None but I don’t know why.”
For production inference, this is invaluable. When a model fails on a user’s device, the error message tells you exactly what went wrong.
Reason 6: Zero-Cost Abstractions
Our Envelope type (which carries data between pipeline stages) uses an enum:
pub enum EnvelopeKind {
Audio(Vec<u8>),
Text(String),
Embedding(Vec<f32>),
Tokens(Vec<i64>),
Tensor { data: Vec<f32>, shape: Vec<usize> },
} Pattern matching on this enum compiles to a jump table — no heap allocation, no vtable lookup, no dynamic dispatch. The abstraction costs literally nothing at runtime.
In Python, the equivalent would be:
class Envelope:
def __init__(self, kind, data):
self.kind = kind # heap-allocated string
self.data = data # heap-allocated, dynamically typed Every attribute access goes through the dictionary lookup. Every type check is a runtime operation. For inference pipelines that run millions of times, these costs add up.
When Python Still Wins
Let’s be fair. Python is better when:
- Prototyping — Jupyter notebooks for testing model behavior
- Training — PyTorch/JAX ecosystem is unmatched
- One-off scripts — quick data processing, model conversion
- Team expertise — if your team knows Python and not Rust
We use Python for model conversion (ONNX export), testing (comparing outputs with reference implementations), and prototyping new preprocessing steps. Then we implement the production version in Rust.
The boundary is clear: Python for research, Rust for production.
The Numbers
Our Rust inference SDK:
| Metric | Value |
|---|---|
| Core library size | ~2MB stripped binary |
| Cold start (model load) | 50-200ms depending on model |
| Inference overhead | <1ms above raw ONNX Runtime |
| Memory overhead | ~5MB above model weights |
| Platforms | 6 (macOS, iOS, Android, Linux, Windows, WASM) |
| Languages served | 5 (Rust, Dart, Swift, Kotlin, C#) |
Try shipping a Python inference runtime to iOS at 2MB.
TL;DR
| Python | Rust | |
|---|---|---|
| Training | Best choice | Not practical |
| Inference | Works, but heavy | Ideal |
| Binary size | 30-50MB+ runtime | 2MB |
| Cross-platform | Painful | Native |
| FFI to mobile | Very hard | Built-in |
| Memory control | GC-dependent | Deterministic |
| Error handling | Runtime exceptions | Compile-time guarantees |
| Concurrency | GIL-limited | True parallelism |
Python is for building models. Rust is for shipping them.
See it in action: github.com/xybrid-ai/xybrid — one Rust core, five platform SDKs, zero Python in production.
Agree? Disagree? Drop your take in the comments.
Related
- On-Device AI: The Complete Guide — why on-device inference is viable and how to get started.
- One Codebase, Five Platforms — how the Rust core powers Flutter, Swift, Kotlin, Unity, and CLI.
- Edge AI vs Cloud AI — the decision framework for where to run each model.