We set ourselves an ambitious goal: build one inference engine that runs on every major platform — desktops, phones, game engines, and the terminal. Not five separate implementations. One.
18 months later, Xybrid ships ML inference to iOS, Android, macOS, Linux, Windows, Flutter, Swift, Kotlin, Unity, and the CLI — all from the same Rust core.
Here’s how we designed it.
The Architecture
┌─────────────────────────────────────────────┐
│ Platform SDKs │
│ Flutter │ Swift │ Kotlin │ Unity │ CLI │
├─────────────────────────────────────────────┤
│ xybrid-sdk (Rust) │
│ Registry │ Cache │ Pipeline │ Telemetry │
├─────────────────────────────────────────────┤
│ xybrid-core (Rust) │
│ TemplateExecutor │ Envelope │ Preprocessing │
├─────────────────────────────────────────────┤
│ Runtime Backends │
│ ONNX Runtime │ Candle │ llama.cpp │
└─────────────────────────────────────────────┘ Every platform SDK is a thin binding over the same Rust code. The business logic — model loading, preprocessing, execution, postprocessing — lives in xybrid-core and xybrid-sdk. It’s written once, tested once, and deployed everywhere.
Decision 1: The Envelope Pattern
The first problem: how do you pass data between pipeline stages when the data type varies? ASR takes audio bytes, TTS takes text, embeddings produce float vectors.
We created the Envelope — a tagged union that carries any payload through the system:
pub struct Envelope {
pub kind: EnvelopeKind,
pub metadata: HashMap<String, String>,
}
pub enum EnvelopeKind {
Audio(Vec<u8>),
Text(String),
Embedding(Vec<f32>),
Tokens(Vec<i64>),
Tensor { data: Vec<f32>, shape: Vec<usize> },
} Every pipeline stage accepts an Envelope and returns an Envelope. This makes stages composable without knowing about each other:
Audio(wav_bytes) → [AudioDecode] → Tensor(samples) → [Model] → Tensor(logits) → [CTCDecode] → Text(transcript) The metadata map carries side-channel info (sample rate, voice ID, message role) without polluting the type. It’s intentionally stringly-typed — pipeline stages are heterogeneous and we don’t want a combinatorial explosion of types.
Decision 2: model_metadata.json as the Contract
The hardest part of ML inference isn’t running the model. It’s everything around it: preprocessing inputs, configuring the session, and postprocessing outputs.
We solved this with a declarative metadata file that ships with every model:
{
"model_id": "kokoro-82m",
"execution_template": {
"type": "SimpleMode",
"model_file": "model.onnx"
},
"preprocessing": [
{ "type": "Phonemize", "backend": "MisakiDictionary", "tokens_file": "tokens.txt" }
],
"postprocessing": [
{ "type": "TTSAudioEncode", "sample_rate": 24000 }
]
} The TemplateExecutor reads this file and handles the full execution flow. No platform-specific inference code. The same metadata drives inference on iOS, Android, and your laptop.
This is the key architectural insight: the model knows how to run itself. The runtime just follows instructions.
let metadata: ModelMetadata = serde_json::from_str(&std::fs::read_to_string(path)?)?;
let mut executor = TemplateExecutor::with_base_path(model_dir);
let output = executor.execute(&metadata, &input)?; Three lines. Works for TTS, ASR, classification, embeddings — any model type.
Decision 3: Three FFI Strategies
Rust is great for writing the core. But getting it into Dart, Swift, Kotlin, and C# requires FFI. We use three different approaches:
| Platform | FFI Tool | Why |
|---|---|---|
| Flutter | flutter_rust_bridge (FRB) | Auto-generates Dart bindings, handles async, supports streaming callbacks |
| Swift & Kotlin | UniFFI | Mozilla’s tool, generates idiomatic bindings from a single UDL definition |
| Unity (C#) | C FFI + cbindgen | Unity needs raw C headers; cbindgen generates them from Rust |
Why three? Because each ecosystem has different expectations:
- Flutter needs async Futures and Streams. FRB handles this natively.
- Swift expects
async/awaitand value types. UniFFI maps Rust types to Swift idioms. - Kotlin expects suspending functions and data classes. UniFFI handles this too.
- Unity needs
DllImportwith C calling conventions. Only C FFI works here.
The binding layer is thin by design. Here’s the entire Kotlin API for running a model:
val model = XybridModelLoader.fromRegistry("kokoro-82m")
val result = model.run(Envelope.text("Hello"))
// result.audioBytes() → play it All the complexity is in Rust. The bindings are projections, not reimplementations.
Decision 4: Feature Flags for Platform Presets
Not every platform supports every backend. CoreML only works on Apple. Metal only on Apple GPUs. Dynamic ORT loading is needed on Android where we can’t statically link.
We use Cargo feature flags composed into platform presets:
[features]
platform-macos = ["ort-download", "ort-coreml", "candle-metal", "llm-llamacpp"]
platform-ios = ["ort-download", "ort-coreml", "candle-metal", "llm-llamacpp"]
platform-android = ["ort-dynamic", "candle", "llm-llamacpp"]
platform-desktop = ["ort-download", "llm-llamacpp"] Invalid combinations are caught at compile time:
#[cfg(all(feature = "ort-download", feature = "ort-dynamic"))]
compile_error!("ort-download and ort-dynamic are mutually exclusive"); The build system (cargo xtask) auto-detects the target triple and applies the correct preset. Developers don’t think about feature flags — they just build for their target.
Decision 5: Always-Available Types
A subtle but important pattern: types that describe capabilities are available even when the capability is disabled.
// These compile without any feature flags:
pub struct GenerationConfig {
pub max_tokens: usize,
pub temperature: f32,
pub top_p: f32,
}
pub struct ChatMessage {
pub role: MessageRole,
pub content: String,
} Why? Because downstream code (Flutter bindings, SDK) needs to reference these types in function signatures even on platforms where LLM inference isn’t available. Without this, you’d need #[cfg] conditionals everywhere in the binding layer.
The types always exist. The implementations are gated behind features. Clean separation.
The Hard Parts
ONNX Runtime on iOS
ORT doesn’t ship a nice iOS package. We vendor a pre-built xcframework in vendor/ort-ios/ and share it across every build path (xtask, Flutter, SPM) via a single resolve_ort_lib_location() function.
A symlink at bindings/flutter/ios/Frameworks/onnxruntime.xcframework keeps CocoaPods happy. It’s ugly, but it works on every CI machine without manual setup.
Candle on Android
Candle uses the gemm-f16 crate which requires ARM +fp16 instructions. We had to add target-specific rustflags:
# .cargo/config.toml
[target.aarch64-linux-android]
rustflags = ["-C", "target-feature=+fp16"] Runtime dispatch still works on older devices without FP16. The flag only affects compilation.
Mutable Whisper Decoding
ONNX models use session.run() with &self. But Whisper’s autoregressive decoding via Candle needs &mut self — each token generation modifies internal state.
Our solution: TemplateExecutor detects Candle models and handles mutability internally, while the public API stays immutable. Callers don’t know or care.
What We’d Do Differently
Start with UniFFI for all native bindings. We built the C FFI layer first, then realized UniFFI gives us Swift + Kotlin for free. The C layer is only still needed for Unity.
Define the API contract earlier. We now have an
api-surface.yamlthat defines the public API across all SDKs. Adding it from day one would have prevented drift between platforms.Vendor fewer things. ORT iOS vendoring was necessary, but every vendored dependency is a maintenance burden. We’d push harder for upstream packages.
Results
From a single Rust codebase:
- 6 platforms shipping (CLI, Flutter, Swift, Kotlin, Unity, native Rust)
- 3 inference backends (ONNX Runtime, Candle, llama.cpp)
- 58 unit tests + 7 doctests in core, all passing
- One model_metadata.json per model, works everywhere
The Rust core is ~15K lines. Each binding layer is under 1K lines. That’s the power of putting the logic in one place.
Xybrid is open-source: github.com/xybrid-ai/xybrid
If you’re building cross-platform AI features and tired of maintaining separate inference code per platform — check it out. We’d love your feedback.
Have questions about the architecture? Drop them in the comments — happy to go deeper on any of these decisions.