You want speech-to-text in a compiled binary — a CLI tool, a backend service, an embedded system. No Python runtime. No cloud API. Just Rust, Whisper, and audio bytes in → text out.
We’ll use Xybrid with Whisper running via Candle — Hugging Face’s pure-Rust ML framework. The entire inference stack compiles to a single native binary with zero runtime dependencies.
Who This Is For
- CLI transcription tools — batch process audio files on a server
- Embedded systems — Raspberry Pi, edge devices, kiosks
- Desktop apps — native macOS/Linux/Windows
- Backend services — transcribe uploads without paying cloud API fees
- Privacy-critical systems — medical, legal, financial where data must stay on-premise
For these, you want a compiled binary with no runtime. Rust gives you that. Candle gives you pure-Rust ML inference. No shared libraries to manage.
Quick Start: CLI
If you just want transcription without writing code:
cargo install xybrid-cli
xybrid run --model whisper-tiny --input recording.wav
# Output: "Hello, this is a test of on-device transcription." That’s it. But let’s build it from scratch to understand what’s happening underneath.
Step 1: Set Up the Project
cargo new speech-to-text
cd speech-to-text # Cargo.toml
[dependencies]
xybrid-core = { version = "0.1.0", features = ["candle"] }
xybrid-sdk = "0.1.0"
anyhow = "1.0" The candle feature enables pure-Rust inference via Candle. No system dependencies to install.
Step 2: Download and Cache the Model
use xybrid_sdk::RegistryClient;
use std::path::PathBuf;
fn get_model() -> anyhow::Result<PathBuf> {
let client = RegistryClient::default_client()?;
if let Some(path) = client.cached_path("whisper-tiny")? {
return Ok(path);
}
println!("Downloading Whisper Tiny (~75MB)...");
let path = client.fetch("whisper-tiny", None, |p| {
print!("
{:.0}%", p * 100.0);
})?;
println!("
Cached at: {}", path.display());
Ok(path)
} First run downloads to ~/.xybrid/cache/whisper-tiny/. Every subsequent run loads from disk. No network.
Step 3: Transcribe
use xybrid_core::execution::{ModelMetadata, TemplateExecutor};
use xybrid_core::ir::{Envelope, EnvelopeKind};
fn transcribe(model_dir: &PathBuf, audio_path: &str) -> anyhow::Result<String> {
let metadata: ModelMetadata = serde_json::from_str(
&std::fs::read_to_string(model_dir.join("model_metadata.json"))?
)?;
let audio_bytes = std::fs::read(audio_path)?;
let input = Envelope {
kind: EnvelopeKind::Audio(audio_bytes),
metadata: std::collections::HashMap::new(),
};
let mut executor = TemplateExecutor::with_base_path(model_dir.to_str().unwrap());
let output = executor.execute(&metadata, &input)?;
match output.kind {
EnvelopeKind::Text(text) => Ok(text),
_ => Err(anyhow::anyhow!("Expected text output")),
}
} The TemplateExecutor reads model_metadata.json and handles everything: WAV decoding → mel spectrogram → Whisper encoder-decoder → token decoding → text. You don’t configure any of it.
Step 4: Wire It Up
fn main() -> anyhow::Result<()> {
let args: Vec<String> = std::env::args().collect();
let audio_path = args.get(1)
.ok_or_else(|| anyhow::anyhow!("Usage: speech-to-text <audio.wav>"))?;
let model_dir = get_model()?;
let text = transcribe(&model_dir, audio_path)?;
println!("{}", text);
Ok(())
} cargo run -- recording.wav
# Hello, this is a test of on-device transcription. A single binary. No Python. No containers. Deploy it anywhere Rust compiles.
Why Candle Instead of ONNX Runtime
Whisper uses autoregressive decoding — each output token depends on previous tokens. This requires mutable state between inference steps.
ONNX Runtime’s session.run() is stateless (&self). You’d need to manage decoder state externally, which is complex and error-prone.
Candle handles this natively. Xybrid detects Whisper models and automatically routes them through Candle:
model_metadata.json: "type": "WhisperMode"
→ TemplateExecutor detects Candle backend
→ CandleRuntimeAdapter with &mut self
→ Autoregressive decoding handled internally You don’t choose the backend. The metadata file determines it.
Batch Processing
For server-side use cases — transcribing uploads, processing audio archives:
fn transcribe_batch(model_dir: &PathBuf, files: &[&str]) -> anyhow::Result<()> {
let metadata: ModelMetadata = serde_json::from_str(
&std::fs::read_to_string(model_dir.join("model_metadata.json"))?
)?;
let mut executor = TemplateExecutor::with_base_path(model_dir.to_str().unwrap());
for file in files {
let audio_bytes = std::fs::read(file)?;
let input = Envelope {
kind: EnvelopeKind::Audio(audio_bytes),
metadata: std::collections::HashMap::new(),
};
let output = executor.execute(&metadata, &input)?;
if let EnvelopeKind::Text(text) = &output.kind {
println!("{}: {}", file, text);
}
}
Ok(())
} The executor reuses loaded model weights across calls. First inference is slower (model loads into memory). Subsequent calls reuse everything.
Performance
Measured transcribing a 10-second audio clip:
| Device | whisper-tiny | whisper-base |
|---|---|---|
| MacBook Pro M2 | 0.8s | 2.1s |
| Desktop (i7-12700) | 1.2s | 3.5s |
| Raspberry Pi 4 | 8s | 22s |
Whisper Tiny processes faster than real-time on any modern laptop. Even a Raspberry Pi 4 handles it — slower, but viable for non-interactive use cases.
The Privacy Architecture
Audio file → Rust binary → Text output
│
└── Model loaded from ~/.xybrid/cache/
(downloaded once, never phones home) What doesn’t happen:
- No audio sent to any server
- No telemetry about what you transcribe
- No API keys that could be revoked
- No model updates that change behavior without your knowledge
The model is a file on your disk. Inference is a function call. The binary works airgapped.
Available Whisper Models
| Model | Parameters | Size | Speed | Accuracy |
|---|---|---|---|---|
whisper-tiny | 39M | ~75MB | Real-time | Good for clear speech |
whisper-base | 74M | ~150MB | Near real-time | Better accuracy |
whisper-small | 244M | ~500MB | Slower | Much better accuracy |
Start with whisper-tiny. Upgrade if accuracy matters more than speed for your use case.
Get started:
cargo install xybrid-cli
xybrid run --model whisper-tiny --input your-audio.wav GitHub: github.com/xybrid-ai/xybrid
Building privacy-first voice features in Rust? Share your use case in the comments.
Related
- Building a Voice Agent That Runs Entirely On-Device — full pipeline tutorial: ASR + LLM + TTS in Flutter.
- Private AI: How to Run AI Models Without Sending Data to the Cloud — the compliance and privacy case for local inference.
- On-Device AI: The Complete Guide — the broader landscape of on-device ML inference.