← Back to blog Engineering

Building a Cross-Platform ML Inference SDK in Rust

How we built a single Rust core that powers ML inference across CLI, Flutter, Swift, Kotlin, and Unity — and the architectural decisions that made it possible.

Glenn Sonna

· June 12, 2026 · 7 min read

rustaiarchitectureopensource

We set ourselves an ambitious goal: build one inference engine that runs on every major platform — desktops, phones, game engines, and the terminal. Not five separate implementations. One.

18 months later, Xybrid ships ML inference to iOS, Android, macOS, Linux, Windows, Flutter, Swift, Kotlin, Unity, and the CLI — all from the same Rust core.

Here’s how we designed it.

The Architecture

┌─────────────────────────────────────────────┐
│              Platform SDKs                   │
│  Flutter │ Swift │ Kotlin │ Unity │ CLI      │
├─────────────────────────────────────────────┤
│              xybrid-sdk (Rust)               │
│  Registry │ Cache │ Pipeline │ Telemetry     │
├─────────────────────────────────────────────┤
│              xybrid-core (Rust)              │
│  TemplateExecutor │ Envelope │ Preprocessing │
├─────────────────────────────────────────────┤
│              Runtime Backends                │
│  ONNX Runtime │ Candle │ llama.cpp           │
└─────────────────────────────────────────────┘

Every platform SDK is a thin binding over the same Rust code. The business logic — model loading, preprocessing, execution, postprocessing — lives in xybrid-core and xybrid-sdk. It’s written once, tested once, and deployed everywhere.

Decision 1: The Envelope Pattern

The first problem: how do you pass data between pipeline stages when the data type varies? ASR takes audio bytes, TTS takes text, embeddings produce float vectors.

We created the Envelope — a tagged union that carries any payload through the system:

pub struct Envelope {
    pub kind: EnvelopeKind,
    pub metadata: HashMap<String, String>,
}

pub enum EnvelopeKind {
    Audio(Vec<u8>),
    Text(String),
    Embedding(Vec<f32>),
    Tokens(Vec<i64>),
    Tensor { data: Vec<f32>, shape: Vec<usize> },
}

Every pipeline stage accepts an Envelope and returns an Envelope. This makes stages composable without knowing about each other:

Audio(wav_bytes) → [AudioDecode] → Tensor(samples) → [Model] → Tensor(logits) → [CTCDecode] → Text(transcript)

The metadata map carries side-channel info (sample rate, voice ID, message role) without polluting the type. It’s intentionally stringly-typed — pipeline stages are heterogeneous and we don’t want a combinatorial explosion of types.

Decision 2: model_metadata.json as the Contract

The hardest part of ML inference isn’t running the model. It’s everything around it: preprocessing inputs, configuring the session, and postprocessing outputs.

We solved this with a declarative metadata file that ships with every model:

{
  "model_id": "kokoro-82m",
  "execution_template": {
    "type": "SimpleMode",
    "model_file": "model.onnx"
  },
  "preprocessing": [
    { "type": "Phonemize", "backend": "MisakiDictionary", "tokens_file": "tokens.txt" }
  ],
  "postprocessing": [
    { "type": "TTSAudioEncode", "sample_rate": 24000 }
  ]
}

The TemplateExecutor reads this file and handles the full execution flow. No platform-specific inference code. The same metadata drives inference on iOS, Android, and your laptop.

This is the key architectural insight: the model knows how to run itself. The runtime just follows instructions.

let metadata: ModelMetadata = serde_json::from_str(&std::fs::read_to_string(path)?)?;
let mut executor = TemplateExecutor::with_base_path(model_dir);
let output = executor.execute(&metadata, &input)?;

Three lines. Works for TTS, ASR, classification, embeddings — any model type.

Decision 3: Three FFI Strategies

Rust is great for writing the core. But getting it into Dart, Swift, Kotlin, and C# requires FFI. We use three different approaches:

Platform	FFI Tool	Why
Flutter	flutter_rust_bridge (FRB)	Auto-generates Dart bindings, handles async, supports streaming callbacks
Swift & Kotlin	UniFFI	Mozilla’s tool, generates idiomatic bindings from a single UDL definition
Unity (C#)	C FFI + cbindgen	Unity needs raw C headers; cbindgen generates them from Rust

Why three? Because each ecosystem has different expectations:

Flutter needs async Futures and Streams. FRB handles this natively.
Swift expects async/await and value types. UniFFI maps Rust types to Swift idioms.
Kotlin expects suspending functions and data classes. UniFFI handles this too.
Unity needs DllImport with C calling conventions. Only C FFI works here.

The binding layer is thin by design. Here’s the entire Kotlin API for running a model:

val model = XybridModelLoader.fromRegistry("kokoro-82m")
val result = model.run(Envelope.text("Hello"))
// result.audioBytes() → play it

All the complexity is in Rust. The bindings are projections, not reimplementations.

Decision 4: Feature Flags for Platform Presets

Not every platform supports every backend. CoreML only works on Apple. Metal only on Apple GPUs. Dynamic ORT loading is needed on Android where we can’t statically link.

We use Cargo feature flags composed into platform presets:

[features]
platform-macos = ["ort-download", "ort-coreml", "candle-metal", "llm-llamacpp"]
platform-ios = ["ort-download", "ort-coreml", "candle-metal", "llm-llamacpp"]
platform-android = ["ort-dynamic", "candle", "llm-llamacpp"]
platform-desktop = ["ort-download", "llm-llamacpp"]

Invalid combinations are caught at compile time:

#[cfg(all(feature = "ort-download", feature = "ort-dynamic"))]
compile_error!("ort-download and ort-dynamic are mutually exclusive");

The build system (cargo xtask) auto-detects the target triple and applies the correct preset. Developers don’t think about feature flags — they just build for their target.

Decision 5: Always-Available Types

A subtle but important pattern: types that describe capabilities are available even when the capability is disabled.

// These compile without any feature flags:
pub struct GenerationConfig {
    pub max_tokens: usize,
    pub temperature: f32,
    pub top_p: f32,
}

pub struct ChatMessage {
    pub role: MessageRole,
    pub content: String,
}

Why? Because downstream code (Flutter bindings, SDK) needs to reference these types in function signatures even on platforms where LLM inference isn’t available. Without this, you’d need #[cfg] conditionals everywhere in the binding layer.

The types always exist. The implementations are gated behind features. Clean separation.

The Hard Parts

ONNX Runtime on iOS

ORT doesn’t ship a nice iOS package. We vendor a pre-built xcframework in vendor/ort-ios/ and share it across every build path (xtask, Flutter, SPM) via a single resolve_ort_lib_location() function.

A symlink at bindings/flutter/ios/Frameworks/onnxruntime.xcframework keeps CocoaPods happy. It’s ugly, but it works on every CI machine without manual setup.

Candle on Android

Candle uses the gemm-f16 crate which requires ARM +fp16 instructions. We had to add target-specific rustflags:

# .cargo/config.toml
[target.aarch64-linux-android]
rustflags = ["-C", "target-feature=+fp16"]

Runtime dispatch still works on older devices without FP16. The flag only affects compilation.

Mutable Whisper Decoding

ONNX models use session.run() with &self. But Whisper’s autoregressive decoding via Candle needs &mut self — each token generation modifies internal state.

Our solution: TemplateExecutor detects Candle models and handles mutability internally, while the public API stays immutable. Callers don’t know or care.

What We’d Do Differently

Start with UniFFI for all native bindings. We built the C FFI layer first, then realized UniFFI gives us Swift + Kotlin for free. The C layer is only still needed for Unity.
Define the API contract earlier. We now have an api-surface.yaml that defines the public API across all SDKs. Adding it from day one would have prevented drift between platforms.
Vendor fewer things. ORT iOS vendoring was necessary, but every vendored dependency is a maintenance burden. We’d push harder for upstream packages.

Results

From a single Rust codebase:

6 platforms shipping (CLI, Flutter, Swift, Kotlin, Unity, native Rust)
3 inference backends (ONNX Runtime, Candle, llama.cpp)
58 unit tests + 7 doctests in core, all passing
One model_metadata.json per model, works everywhere

The Rust core is ~15K lines. Each binding layer is under 1K lines. That’s the power of putting the logic in one place.

Xybrid is open-source: github.com/xybrid-ai/xybrid

If you’re building cross-platform AI features and tired of maintaining separate inference code per platform — check it out. We’d love your feedback.

Have questions about the architecture? Drop them in the comments — happy to go deeper on any of these decisions.

Jul 10, 2026 · 6 min read

model_metadata.json: A Declarative Schema for ML Model Execution

How a single JSON file replaces custom inference scripts and makes any ML model self-describing. No more 'works on my machine.'

machinelearningarchitectureopensource

Jul 17, 2026 · 5 min read

Publishing ML Models as Bundles with xybrid-pack

A packaging pipeline that turns raw model files into versioned, checksummed, cross-platform bundles hosted on HuggingFace. No more 'download the model and put it here.'

machinelearningdevopsopensource

Jul 14, 2026 · 7 min read

Vendoring llama.cpp in a Rust Workspace (Lessons Learned)

The practical pain of embedding a C++ inference engine in a Rust monorepo — cross-compilation, Android fp16, think-tag stripping, and when to give up on upstream.

rustcppai