← Back to blog Tutorial

Whisper Speech Recognition as a Single Rust Binary

Build a Whisper-powered transcription tool as a single Rust binary. Candle for pure-Rust inference. No runtime dependencies, works airgapped.

Glenn Sonna

· June 30, 2026 · 5 min read

whisper-rustoffline-aispeech-recognitionlocal-aiprivate-ai

You want speech-to-text in a compiled binary — a CLI tool, a backend service, an embedded system. No Python runtime. No cloud API. Just Rust, Whisper, and audio bytes in → text out.

We’ll use Xybrid with Whisper running via Candle — Hugging Face’s pure-Rust ML framework. The entire inference stack compiles to a single native binary with zero runtime dependencies.

Who This Is For

CLI transcription tools — batch process audio files on a server
Embedded systems — Raspberry Pi, edge devices, kiosks
Desktop apps — native macOS/Linux/Windows
Backend services — transcribe uploads without paying cloud API fees
Privacy-critical systems — medical, legal, financial where data must stay on-premise

For these, you want a compiled binary with no runtime. Rust gives you that. Candle gives you pure-Rust ML inference. No shared libraries to manage.

Quick Start: CLI

If you just want transcription without writing code:

cargo install xybrid-cli
xybrid run --model whisper-tiny --input recording.wav
# Output: "Hello, this is a test of on-device transcription."

That’s it. But let’s build it from scratch to understand what’s happening underneath.

Step 1: Set Up the Project

cargo new speech-to-text
cd speech-to-text

# Cargo.toml
[dependencies]
xybrid-core = { version = "0.1.0", features = ["candle"] }
xybrid-sdk = "0.1.0"
anyhow = "1.0"

The candle feature enables pure-Rust inference via Candle. No system dependencies to install.

Step 2: Download and Cache the Model

use xybrid_sdk::RegistryClient;
use std::path::PathBuf;

fn get_model() -> anyhow::Result<PathBuf> {
    let client = RegistryClient::default_client()?;

    if let Some(path) = client.cached_path("whisper-tiny")? {
        return Ok(path);
    }

    println!("Downloading Whisper Tiny (~75MB)...");
    let path = client.fetch("whisper-tiny", None, |p| {
        print!("
  {:.0}%", p * 100.0);
    })?;
    println!("
Cached at: {}", path.display());

    Ok(path)
}

First run downloads to ~/.xybrid/cache/whisper-tiny/. Every subsequent run loads from disk. No network.

Step 3: Transcribe

use xybrid_core::execution::{ModelMetadata, TemplateExecutor};
use xybrid_core::ir::{Envelope, EnvelopeKind};

fn transcribe(model_dir: &PathBuf, audio_path: &str) -> anyhow::Result<String> {
    let metadata: ModelMetadata = serde_json::from_str(
        &std::fs::read_to_string(model_dir.join("model_metadata.json"))?
    )?;

    let audio_bytes = std::fs::read(audio_path)?;
    let input = Envelope {
        kind: EnvelopeKind::Audio(audio_bytes),
        metadata: std::collections::HashMap::new(),
    };

    let mut executor = TemplateExecutor::with_base_path(model_dir.to_str().unwrap());
    let output = executor.execute(&metadata, &input)?;

    match output.kind {
        EnvelopeKind::Text(text) => Ok(text),
        _ => Err(anyhow::anyhow!("Expected text output")),
    }
}

The TemplateExecutor reads model_metadata.json and handles everything: WAV decoding → mel spectrogram → Whisper encoder-decoder → token decoding → text. You don’t configure any of it.

Step 4: Wire It Up

fn main() -> anyhow::Result<()> {
    let args: Vec<String> = std::env::args().collect();
    let audio_path = args.get(1)
        .ok_or_else(|| anyhow::anyhow!("Usage: speech-to-text <audio.wav>"))?;

    let model_dir = get_model()?;
    let text = transcribe(&model_dir, audio_path)?;
    println!("{}", text);

    Ok(())
}

cargo run -- recording.wav
# Hello, this is a test of on-device transcription.

A single binary. No Python. No containers. Deploy it anywhere Rust compiles.

Why Candle Instead of ONNX Runtime

Whisper uses autoregressive decoding — each output token depends on previous tokens. This requires mutable state between inference steps.

ONNX Runtime’s session.run() is stateless (&self). You’d need to manage decoder state externally, which is complex and error-prone.

Candle handles this natively. Xybrid detects Whisper models and automatically routes them through Candle:

model_metadata.json: "type": "WhisperMode"
  → TemplateExecutor detects Candle backend
  → CandleRuntimeAdapter with &mut self
  → Autoregressive decoding handled internally

You don’t choose the backend. The metadata file determines it.

Batch Processing

For server-side use cases — transcribing uploads, processing audio archives:

fn transcribe_batch(model_dir: &PathBuf, files: &[&str]) -> anyhow::Result<()> {
    let metadata: ModelMetadata = serde_json::from_str(
        &std::fs::read_to_string(model_dir.join("model_metadata.json"))?
    )?;
    let mut executor = TemplateExecutor::with_base_path(model_dir.to_str().unwrap());

    for file in files {
        let audio_bytes = std::fs::read(file)?;
        let input = Envelope {
            kind: EnvelopeKind::Audio(audio_bytes),
            metadata: std::collections::HashMap::new(),
        };

        let output = executor.execute(&metadata, &input)?;
        if let EnvelopeKind::Text(text) = &output.kind {
            println!("{}: {}", file, text);
        }
    }

    Ok(())
}

The executor reuses loaded model weights across calls. First inference is slower (model loads into memory). Subsequent calls reuse everything.

Performance

Measured transcribing a 10-second audio clip:

Device	whisper-tiny	whisper-base
MacBook Pro M2	0.8s	2.1s
Desktop (i7-12700)	1.2s	3.5s
Raspberry Pi 4	8s	22s

Whisper Tiny processes faster than real-time on any modern laptop. Even a Raspberry Pi 4 handles it — slower, but viable for non-interactive use cases.

The Privacy Architecture

Audio file → Rust binary → Text output
                │
                └── Model loaded from ~/.xybrid/cache/
                    (downloaded once, never phones home)

What doesn’t happen:

No audio sent to any server
No telemetry about what you transcribe
No API keys that could be revoked
No model updates that change behavior without your knowledge

The model is a file on your disk. Inference is a function call. The binary works airgapped.

Available Whisper Models

Model	Parameters	Size	Speed	Accuracy
`whisper-tiny`	39M	~75MB	Real-time	Good for clear speech
`whisper-base`	74M	~150MB	Near real-time	Better accuracy
`whisper-small`	244M	~500MB	Slower	Much better accuracy

Start with whisper-tiny. Upgrade if accuracy matters more than speed for your use case.

Get started:

cargo install xybrid-cli
xybrid run --model whisper-tiny --input your-audio.wav

GitHub: github.com/xybrid-ai/xybrid

Building privacy-first voice features in Rust? Share your use case in the comments.

Building a Voice Agent That Runs Entirely On-Device — full pipeline tutorial: ASR + LLM + TTS in Flutter.
Private AI: How to Run AI Models Without Sending Data to the Cloud — the compliance and privacy case for local inference.
On-Device AI: The Complete Guide — the broader landscape of on-device ML inference.

Jun 9, 2026 · 12 min read

How to Run LLMs Locally: A Complete Guide (2026)

How to run large language models locally on your laptop, desktop, or phone — llama.cpp, Ollama, ONNX Runtime, and on-device options compared. No cloud API needed.

local-llmrun-llm-locallyoffline-ai

Jun 19, 2026 · 7 min read

Edge AI: Why Edge-First Should Be Your Default Architecture

On-device AI should be the starting position, not the optimization. The industry defaults to cloud out of habit, not necessity.

edge-aion-device-aiedge-first

Jun 16, 2026 · 10 min read

Private AI: How to Run AI Models Without Sending Data to the Cloud

How to run AI models privately on-device — no cloud APIs, no data leaving the device. Covers GDPR, HIPAA, and SOC 2 compliance, private LLMs, and practical implementation patterns.

private-aiprivate-llmon-device-ai