“Just enable CoreML, it’ll be faster.”
We heard this advice constantly. So we benchmarked it properly. The results surprised us: 6.8x faster for some models, actually slower for others.
Here’s what we learned optimizing ONNX Runtime inference on Apple Silicon, and the rules of thumb that came out of it.
The Setup
We built Xybrid, an on-device ML inference SDK in Rust. It uses ONNX Runtime as one of its backends, which supports multiple execution providers (EPs):
- CPU — Default, works everywhere
- CoreML — Apple’s ML framework, can target the Neural Engine (ANE)
- Metal — GPU compute via Apple’s Metal API
We wanted to know: for which models does CoreML actually help?
The Benchmark
We used Criterion.rs with three representative models:
| Model | Type | Input Shape | Size |
|---|---|---|---|
| MNIST | Vision (tiny) | [1, 1, 28, 28] | 26 KB |
| MobileNet v2 | Vision (CNN) | [1, 3, 224, 224] | 14 MB |
| Kokoro 82M | TTS (transformer) | dynamic | 330 MB |
# CPU baseline
cargo bench -p xybrid-core
# CoreML (routes eligible ops to ANE)
cargo bench -p xybrid-core --features coreml-ep The Results
Tested on M2 MacBook Pro:
| Model | CPU | CoreML (ANE) | Speedup |
|---|---|---|---|
| MNIST | 0.15ms | 0.8ms | 0.2x (slower) |
| MobileNet v2 | 12.4ms | 1.8ms | 6.8x faster |
| Kokoro 82M | 280ms | 290ms | ~1x (no benefit) |
Wait — MNIST got slower with CoreML? And Kokoro saw no improvement?
Why CoreML Isn’t Always Faster
Rule 1: Tiny Models Lose to Dispatch Overhead
The ANE is a dedicated hardware accelerator. Sending work to it has fixed overhead: the CoreML framework compiles the model graph, schedules it on the ANE, and marshals data between CPU and ANE memory.
For MNIST (26 KB, runs in 0.15ms on CPU), that dispatch overhead costs more than the inference itself. The model is too small to benefit.
Guideline: Models under ~1MB with inference times under 1ms won’t benefit from CoreML. Stick with CPU.
Rule 2: CNNs with Fixed Shapes Are the Sweet Spot
MobileNet is the perfect ANE workload:
- Fixed input shape [1, 3, 224, 224] — the CoreML compiler can optimize the graph once
- Regular operations — convolutions, batch norms, ReLUs map directly to ANE instructions
- Sufficient compute — enough work to amortize dispatch overhead
The 6.8x speedup is real and consistent. Vision models (image classification, object detection, segmentation) with static shapes will almost always benefit.
Guideline: Fixed-shape CNN models → enable CoreML. Expect 5-10x speedup.
Rule 3: Dynamic Shapes Kill ANE Performance
Kokoro TTS is a transformer with dynamic input shapes — the token sequence length varies per input. Each new shape triggers a recompilation of the CoreML graph.
Even when the graph is cached, dynamic shapes prevent the ANE from using its most aggressive optimizations. The result: inference time roughly matches CPU.
Guideline: Models with dynamic input dimensions (most transformers, seq2seq, attention-based models) — CoreML won’t help unless you pad to fixed sizes.
Rule 4: First Inference Has a Cold-Start Penalty
CoreML compiles the model to ANE instructions on first run. This adds 1-5 seconds of startup time depending on model complexity.
For benchmarks, we use warm iterations. But in production, this cold start matters. Xybrid’s warmup() API exists specifically for this:
// Pre-compile for ANE during loading screen
executor.warmup(&metadata)?; Guideline: Always warmup CoreML models before latency-sensitive inference.
How We Integrated This in Xybrid
Rather than forcing users to choose execution providers, Xybrid selects them based on the model type and platform:
# Platform presets handle this automatically
[features]
platform-macos = ["ort-download", "ort-coreml", "candle-metal", "llm-llamacpp"]
platform-ios = ["ort-download", "ort-coreml", "candle-metal", "llm-llamacpp"]
platform-android = ["ort-dynamic", "llm-llamacpp"] When CoreML is available, ONNX Runtime automatically evaluates which ops can run on ANE and falls back to CPU for the rest. This per-op routing means even partially supported models get some benefit.
Summary: When to Use What
| Model Type | Best EP | Expected Speedup | Why |
|---|---|---|---|
| Vision (CNN, fixed shape) | CoreML ANE | 5-10x | Regular ops, static shapes |
| Embeddings (fixed shape) | CoreML ANE | 2-4x | Matrix multiply benefits |
| TTS (dynamic shapes) | CPU | 1x | Recompilation overhead |
| Tiny models (<1MB) | CPU | n/a | Dispatch overhead dominates |
| LLMs (GGUF) | Metal GPU | 2-5x | llama.cpp uses Metal directly |
Run the Benchmarks Yourself
git clone https://github.com/xybrid-ai/xybrid.git
cd xybrid
# Download benchmark models
cd integration-tests && ./download.sh mnist mobilenet kokoro-82m && cd ..
# CPU baseline
cargo bench -p xybrid-core
# CoreML
cargo bench -p xybrid-core --features coreml-ep
# View HTML report
open target/criterion/report/index.html The Criterion reports include statistical analysis, confidence intervals, and comparison between runs. No hand-wavy “felt faster” — real numbers.
Key Takeaways
- Benchmark, don’t assume. “Apple Silicon is fast” is not the same as “CoreML is always faster.”
- Model shape matters more than model size. A 14MB CNN (MobileNet) benefits enormously; a 330MB transformer (Kokoro) doesn’t.
- Warmup is mandatory. CoreML cold-start can dominate perceived latency.
- Per-op routing is free. ONNX Runtime’s CoreML EP automatically splits graphs — you don’t have to manually decide.
The 6.8x on MobileNet is real. But it only comes when you know which models to point at the ANE.
Run your own benchmarks: github.com/xybrid-ai/xybrid
Got benchmark results on different hardware? Share them in the comments — we’re collecting data across device generations.
Related
- On-Device AI for Mobile Apps: iOS and Android Guide — framework comparison across CoreML, ONNX Runtime, TF Lite, and llama.cpp.
- On-Device AI: The Complete Guide — the full landscape of running ML models locally.
- Edge AI vs Cloud AI — when on-device beats cloud, with cost analysis.