← Back to blog Engineering

How We Made ONNX Runtime 6.8x Faster on Apple Silicon with CoreML

Real benchmarks showing when Apple's Neural Engine helps (and when it hurts). Lessons from optimizing ML inference across execution providers.

Glenn Sonna
· · 6 min read
onnx-runtimecoremlapple-silicon-mlneural-engineon-device-ai

“Just enable CoreML, it’ll be faster.”

We heard this advice constantly. So we benchmarked it properly. The results surprised us: 6.8x faster for some models, actually slower for others.

Here’s what we learned optimizing ONNX Runtime inference on Apple Silicon, and the rules of thumb that came out of it.


The Setup

We built Xybrid, an on-device ML inference SDK in Rust. It uses ONNX Runtime as one of its backends, which supports multiple execution providers (EPs):

  • CPU — Default, works everywhere
  • CoreML — Apple’s ML framework, can target the Neural Engine (ANE)
  • Metal — GPU compute via Apple’s Metal API

We wanted to know: for which models does CoreML actually help?

The Benchmark

We used Criterion.rs with three representative models:

ModelTypeInput ShapeSize
MNISTVision (tiny)[1, 1, 28, 28]26 KB
MobileNet v2Vision (CNN)[1, 3, 224, 224]14 MB
Kokoro 82MTTS (transformer)dynamic330 MB
# CPU baseline
cargo bench -p xybrid-core

# CoreML (routes eligible ops to ANE)
cargo bench -p xybrid-core --features coreml-ep

The Results

Tested on M2 MacBook Pro:

ModelCPUCoreML (ANE)Speedup
MNIST0.15ms0.8ms0.2x (slower)
MobileNet v212.4ms1.8ms6.8x faster
Kokoro 82M280ms290ms~1x (no benefit)

Wait — MNIST got slower with CoreML? And Kokoro saw no improvement?

Why CoreML Isn’t Always Faster

Rule 1: Tiny Models Lose to Dispatch Overhead

The ANE is a dedicated hardware accelerator. Sending work to it has fixed overhead: the CoreML framework compiles the model graph, schedules it on the ANE, and marshals data between CPU and ANE memory.

For MNIST (26 KB, runs in 0.15ms on CPU), that dispatch overhead costs more than the inference itself. The model is too small to benefit.

Guideline: Models under ~1MB with inference times under 1ms won’t benefit from CoreML. Stick with CPU.

Rule 2: CNNs with Fixed Shapes Are the Sweet Spot

MobileNet is the perfect ANE workload:

  • Fixed input shape [1, 3, 224, 224] — the CoreML compiler can optimize the graph once
  • Regular operations — convolutions, batch norms, ReLUs map directly to ANE instructions
  • Sufficient compute — enough work to amortize dispatch overhead

The 6.8x speedup is real and consistent. Vision models (image classification, object detection, segmentation) with static shapes will almost always benefit.

Guideline: Fixed-shape CNN models → enable CoreML. Expect 5-10x speedup.

Rule 3: Dynamic Shapes Kill ANE Performance

Kokoro TTS is a transformer with dynamic input shapes — the token sequence length varies per input. Each new shape triggers a recompilation of the CoreML graph.

Even when the graph is cached, dynamic shapes prevent the ANE from using its most aggressive optimizations. The result: inference time roughly matches CPU.

Guideline: Models with dynamic input dimensions (most transformers, seq2seq, attention-based models) — CoreML won’t help unless you pad to fixed sizes.

Rule 4: First Inference Has a Cold-Start Penalty

CoreML compiles the model to ANE instructions on first run. This adds 1-5 seconds of startup time depending on model complexity.

For benchmarks, we use warm iterations. But in production, this cold start matters. Xybrid’s warmup() API exists specifically for this:

// Pre-compile for ANE during loading screen
executor.warmup(&metadata)?;

Guideline: Always warmup CoreML models before latency-sensitive inference.

How We Integrated This in Xybrid

Rather than forcing users to choose execution providers, Xybrid selects them based on the model type and platform:

# Platform presets handle this automatically
[features]
platform-macos = ["ort-download", "ort-coreml", "candle-metal", "llm-llamacpp"]
platform-ios   = ["ort-download", "ort-coreml", "candle-metal", "llm-llamacpp"]
platform-android = ["ort-dynamic", "llm-llamacpp"]

When CoreML is available, ONNX Runtime automatically evaluates which ops can run on ANE and falls back to CPU for the rest. This per-op routing means even partially supported models get some benefit.

Summary: When to Use What

Model TypeBest EPExpected SpeedupWhy
Vision (CNN, fixed shape)CoreML ANE5-10xRegular ops, static shapes
Embeddings (fixed shape)CoreML ANE2-4xMatrix multiply benefits
TTS (dynamic shapes)CPU1xRecompilation overhead
Tiny models (<1MB)CPUn/aDispatch overhead dominates
LLMs (GGUF)Metal GPU2-5xllama.cpp uses Metal directly

Run the Benchmarks Yourself

git clone https://github.com/xybrid-ai/xybrid.git
cd xybrid

# Download benchmark models
cd integration-tests && ./download.sh mnist mobilenet kokoro-82m && cd ..

# CPU baseline
cargo bench -p xybrid-core

# CoreML
cargo bench -p xybrid-core --features coreml-ep

# View HTML report
open target/criterion/report/index.html

The Criterion reports include statistical analysis, confidence intervals, and comparison between runs. No hand-wavy “felt faster” — real numbers.


Key Takeaways

  1. Benchmark, don’t assume. “Apple Silicon is fast” is not the same as “CoreML is always faster.”
  2. Model shape matters more than model size. A 14MB CNN (MobileNet) benefits enormously; a 330MB transformer (Kokoro) doesn’t.
  3. Warmup is mandatory. CoreML cold-start can dominate perceived latency.
  4. Per-op routing is free. ONNX Runtime’s CoreML EP automatically splits graphs — you don’t have to manually decide.

The 6.8x on MobileNet is real. But it only comes when you know which models to point at the ANE.


Run your own benchmarks: github.com/xybrid-ai/xybrid


Got benchmark results on different hardware? Share them in the comments — we’re collecting data across device generations.


Related

Related articles

· 12 min read

On-Device AI for Mobile Apps: A Developer's Guide to iOS and Android

How to add on-device AI to iOS and Android apps — comparing CoreML, ONNX Runtime, TensorFlow Lite, and llama.cpp. Practical guide with model sizes, performance benchmarks, and integration patterns.

on-device-aimobile-aiai-on-iphone
· 7 min read

Why We Chose Rust Over Python for ML Inference

Not training — inference. How Rust's zero-cost abstractions, lack of GIL, and FFI story make it the better choice for shipping ML to production devices.

rust-mlrust-vs-pythonml-inference
· 7 min read

Edge AI: Why Edge-First Should Be Your Default Architecture

On-device AI should be the starting position, not the optimization. The industry defaults to cloud out of habit, not necessity.

edge-aion-device-aiedge-first