← Back to blog Engineering

On-Device AI for Mobile Apps: A Developer's Guide to iOS and Android

How to add on-device AI to iOS and Android apps — comparing CoreML, ONNX Runtime, TensorFlow Lite, and llama.cpp. Practical guide with model sizes, performance benchmarks, and integration patterns.

Glenn Sonna

· June 23, 2026 · 12 min read

on-device-aimobile-aiai-on-iphoneai-on-androidcoremlmobile-ml

Adding AI features to a mobile app used to mean integrating a cloud API. Send user data to a server, wait for a response, pay per request. That model works, but it is no longer the only option — and for many features, it is no longer the best one.

Modern smartphones have hardware specifically designed for ML inference: Apple’s Neural Engine, Qualcomm’s Hexagon DSP, Google’s Tensor TPU. A flagship phone from 2024 can run a speech recognition model in under 100ms, generate natural-sounding speech, classify images in real-time, and even run a 1B parameter language model. All without an internet connection.

This guide covers the practical side of adding on-device AI to iOS and Android apps: which frameworks to use, which models fit on mobile hardware, what performance to expect, and how to handle the engineering challenges of bundling and running models on constrained devices.

The Mobile AI Hardware Landscape

Before choosing a framework, understand what hardware is available to you.

iOS (Apple Silicon)

Every iPhone since the iPhone 8 (2017) includes a Neural Engine — dedicated hardware for ML inference. The capability has scaled dramatically:

Chip	Neural Engine	TOPS	Available Memory
A11 (iPhone 8/X)	2-core	0.6	2-3 GB
A14 (iPhone 12)	16-core	11	4 GB
A15 (iPhone 13/14)	16-core	15.8	4-6 GB
A16 (iPhone 14 Pro/15)	16-core	17	6 GB
A17 Pro (iPhone 15 Pro)	16-core	35	8 GB
A18 Pro (iPhone 16 Pro)	16-core	35+	8 GB

The Neural Engine is accessed through CoreML, Apple’s ML framework. Models that CoreML can optimize for the Neural Engine run dramatically faster than CPU-only inference — we have measured 6.8x speedups on vision models like MobileNet.

The GPU (Metal) is also available for ML workloads that do not map well to the Neural Engine, such as models with dynamic shapes or certain attention patterns.

Android

Android hardware is more varied. The key accelerators:

Qualcomm (Snapdragon): The Hexagon DSP and AI Engine provide ML acceleration. Accessed via ONNX Runtime’s QNN provider or Qualcomm’s AI Hub SDK. Performance varies significantly by chipset generation.

Google Tensor: Custom chips in Pixel phones with dedicated TPU cores. Accessed via TensorFlow Lite with the Google Edge TPU delegate.

MediaTek: APU (AI Processing Unit) available in Dimensity chips. Accessed via the MediaTek NeuroPilot SDK or ONNX Runtime.

Samsung (Exynos): NPU available in some Galaxy devices. Accessed via Samsung’s Neural SDK.

The fragmentation is real. An AI feature that runs great on a Pixel 8 (Tensor G3) might be slow on a mid-range Snapdragon. Practical approach: target CPU inference as the baseline, use hardware acceleration as an optimization for specific chipsets.

Chipset	Category	AI Performance	Available Memory
Snapdragon 8 Gen 3	Flagship	45 TOPS	8-16 GB
Snapdragon 8 Gen 2	Flagship	33 TOPS	8-12 GB
Snapdragon 7 Gen 1	Mid-range	13 TOPS	6-8 GB
Tensor G3 (Pixel 8)	Flagship	~20 TOPS	12 GB
Dimensity 9300	Flagship	41 TOPS	8-16 GB

Inference Frameworks Compared

Five frameworks dominate mobile ML inference. Each has different strengths.

CoreML (iOS only)

Apple’s native ML framework. First-class access to the Neural Engine, GPU (Metal), and CPU. Tightly integrated with Xcode and the Apple ecosystem.

Strengths:

Best Neural Engine utilization on iOS
Optimized for Apple Silicon
Supports model conversion from ONNX, TensorFlow, and PyTorch
Xcode integration with model preview and profiling

Tradeoffs:

iOS and macOS only
Limited model format support (must convert to .mlmodel/.mlpackage)
Less control over execution providers than ONNX Runtime
Dynamic shapes can force CPU fallback (losing Neural Engine benefits)

Best for: iOS-exclusive apps where maximum performance on Apple hardware matters.

ONNX Runtime

Microsoft’s cross-platform inference engine. Runs on iOS (CoreML backend), Android (NNAPI, QNN), and desktop. Supports the standardized ONNX model format.

Strengths:

Cross-platform: same model runs on iOS, Android, Windows, macOS, Linux
Multiple execution providers (CoreML, CUDA, DirectML, NNAPI, QNN)
Broad model format support
Active development and optimization

Tradeoffs:

Binary size is significant (~15-30 MB depending on providers)
CoreML provider on iOS does not always match native CoreML performance
NNAPI on Android is deprecated in favor of vendor-specific APIs

Best for: Cross-platform apps that need to run the same model on iOS and Android. Vision, audio, and embedding models.

TensorFlow Lite

Google’s mobile inference framework. Mature ecosystem with good model conversion tools and Android-first optimizations.

Strengths:

Mature model conversion from TensorFlow/Keras
Good GPU delegate for Android
Google Edge TPU support on Pixel devices
Extensive model zoo and documentation

Tradeoffs:

TensorFlow ecosystem lock-in
iOS support exists but is not the primary focus
Less active development than ONNX Runtime
Cannot run ONNX models directly

Best for: Android-first apps already in the TensorFlow ecosystem.

llama.cpp

C/C++ library for LLM inference. The de facto standard for running language models on consumer hardware, including mobile.

Strengths:

Best local LLM inference performance
Metal acceleration on iOS, Vulkan on Android
Minimal dependencies, small binary
GGUF format is the standard for quantized LLMs

Tradeoffs:

LLMs only (not for vision, audio, or embeddings)
Requires C/C++ FFI from Swift, Kotlin, Dart, etc.
Memory management on mobile requires careful handling

Best for: Any app that needs to run an LLM on-device.

Platform Summary

Feature	CoreML	ONNX Runtime	TF Lite	llama.cpp
iOS	Native	Via CoreML EP	Supported	Metal
Android	No	NNAPI/QNN	Native	Vulkan
Model types	Any	Any	Any	LLMs only
LLM support	Limited	Limited	Limited	Excellent
Neural Engine	Yes	Via CoreML	No	No
Binary size	0 (system)	15-30 MB	5-10 MB	2-5 MB

Practical recommendation: Use ONNX Runtime or CoreML for non-LLM models (ASR, TTS, embeddings, vision). Use llama.cpp for LLMs. This gives you the best performance for each model type.

What Runs on a Phone Today

Here are models that run well on current mobile hardware, tested on real devices:

Speech Recognition (ASR)

Model	Size	iPhone 15 Pro	Pixel 8 Pro	Quality
Whisper Tiny	75 MB	~60ms/sec audio	~90ms/sec audio	Good for English
Whisper Base	140 MB	~120ms/sec audio	~180ms/sec audio	Better multilingual
Whisper Small	460 MB	~400ms/sec audio	~600ms/sec audio	Best mobile quality

Whisper Tiny processes audio faster than real-time on any modern phone. It handles English accurately and is the best choice for real-time transcription on mobile.

Text-to-Speech (TTS)

Model	Size	iPhone 15 Pro	Pixel 8 Pro	Quality
Kokoro 82M	180 MB	~0.8s per sentence	~1.2s per sentence	Natural, expressive
Piper (onnx)	60-100 MB	~0.3s per sentence	~0.5s per sentence	Fast, slightly robotic

Kokoro produces remarkably natural speech for its size. The slight synthesis delay is acceptable for most use cases — users do not notice a sub-second pause before speech begins.

Language Models (LLM)

Model	Size (Q4)	iPhone 15 Pro	Pixel 8 Pro	Quality
Qwen 3.5 0.8B	~500 MB	~20 tok/s	~12 tok/s	Good for simple tasks
Llama 3.2 1B	~700 MB	~18 tok/s	~10 tok/s	Better reasoning
Llama 3.2 3B	~1.8 GB	~8 tok/s	~5 tok/s	Desktop-like quality

The 1B models are the sweet spot for mobile: fast enough for conversational use, small enough to download over cellular, and capable enough for summarization, Q&A, and simple chat.

Embeddings

Model	Size	iPhone 15 Pro	Pixel 8 Pro	Use Case
MiniLM-L6	80 MB	~5ms per query	~8ms per query	Semantic search
all-MiniLM-L6	80 MB	~5ms per query	~8ms per query	Similarity matching

Embedding models are small, fast, and enable powerful features: on-device semantic search, document clustering, and RAG without sending documents to the cloud.

Integration Patterns

Pattern 1: Download on First Launch

The most common approach. The app ships without models. On first launch, it downloads the required model(s) from a CDN.

Pros: Small initial app size. Models can be updated independently of the app. Cons: First-launch experience requires download and wait time. Needs network on first use.

Implementation notes:

Show a progress indicator with estimated time
Download in the background if possible
Cache the model in the app’s documents directory (persists across updates)
Verify integrity with a SHA-256 checksum after download

Pattern 2: Bundle in App

Ship the model file inside the app binary. The app works immediately with no download step.

Pros: Works offline from first launch. No download management code. Cons: Increases app binary size significantly. App Store size limits apply (iOS: 4 GB, but >200 MB triggers a warning. Google Play: 150 MB AAB + 2 GB expansion files).

Best for: Apps where the model is small (under 50 MB) or where offline-first is a hard requirement.

Pattern 3: Lazy Loading With Fallback

Load models on demand when the user first accesses the feature. If the model is not available, show a download prompt or fall back to a cloud API.

Pros: Minimizes initial footprint. Only downloads what the user actually uses. Cons: Feature is not available until model is downloaded. Requires cloud fallback or graceful degradation.

Memory Management on Mobile

Mobile devices have strict memory limits. iOS will kill your app if it uses too much memory. Android has similar constraints.

Rules of thumb:

Keep total model memory under 50% of available RAM
Unload models when not actively in use
Use lazy loading — do not load all models at app startup
Monitor memory warnings (iOS didReceiveMemoryWarning, Android onTrimMemory)
Prefer quantized models (Q4) to reduce memory footprint

For LLMs, the context window length directly impacts memory usage. A 1B model with a 2K context uses significantly less memory than the same model with an 8K context. Cap the context length to what the use case actually requires.

Performance Optimization

Use Hardware Acceleration

iOS: CoreML with Neural Engine for vision/audio models. Metal for LLMs via llama.cpp.
Android: NNAPI or vendor-specific delegates for vision/audio. Vulkan for LLMs via llama.cpp.

Hardware acceleration can provide 2-10x speedups over CPU-only inference, depending on the model architecture.

Profile Before Optimizing

Use platform profiling tools to identify bottlenecks:

iOS: Xcode Instruments (Neural Engine, GPU, CPU profiling)
Android: Android Studio Profiler, Perfetto for system-level tracing

Common bottleneck: preprocessing (decoding audio, resizing images) can take longer than inference itself. Optimize the full pipeline, not just the model execution.

Quantize Aggressively

Q4_K_M quantization reduces model size by 4x compared to FP16 with minimal quality loss. For mobile, always use quantized models. The quality difference is negligible for most tasks, and the memory and speed improvements are substantial.

Getting Started

Pick your model. Start with one model for one feature. Speech recognition (Whisper Tiny), TTS (Kokoro), embeddings (MiniLM), or chat (Llama 3.2 1B) are all proven on mobile.
Pick your runtime. ONNX Runtime for non-LLM models, llama.cpp for LLMs. Both work on iOS and Android.
Prototype on-device. Get the model running on a physical device (not a simulator). Measure actual latency and memory usage.
Handle the packaging. Decide between download-on-first-launch and bundled. For models over 50 MB, download-on-first-launch is usually the better UX.
Test on lower-end devices. Your model runs fast on an iPhone 15 Pro. Does it run acceptably on an iPhone 12? On a Pixel 6a? Define your minimum supported hardware and benchmark there.

The hardware is ready. The models are small enough. The frameworks are mature enough. On-device AI on mobile is not a future possibility — it is a present capability waiting for developers to use it.

On-Device AI: The Complete Guide to Running ML Models Locally — the broader guide covering desktop and mobile inference.
How to Run LLMs Locally — deep dive into local LLM runtimes, quantization, and model selection.
Add Text-to-Speech to Your Flutter App in 15 Minutes — hands-on tutorial for on-device TTS in Flutter.
Building a Voice Agent That Runs Entirely On-Device — full voice pipeline tutorial: ASR + LLM + TTS on mobile.

Mar 23, 2026 · 12 min read

On-Device AI: The Complete Guide to Running ML Models Locally

Everything you need to know about running machine learning models directly on mobile and desktop devices — privacy, latency, cost benefits, and how to get started.

on-device-aiedge-inferencemobile-ml

Jun 19, 2026 · 7 min read

Edge AI: Why Edge-First Should Be Your Default Architecture

On-device AI should be the starting position, not the optimization. The industry defaults to cloud out of habit, not necessity.

edge-aion-device-aiedge-first

Jun 16, 2026 · 10 min read

Private AI: How to Run AI Models Without Sending Data to the Cloud

How to run AI models privately on-device — no cloud APIs, no data leaving the device. Covers GDPR, HIPAA, and SOC 2 compliance, private LLMs, and practical implementation patterns.

private-aiprivate-llmon-device-ai

On-Device AI for Mobile Apps: A Developer's Guide to iOS and Android

The Mobile AI Hardware Landscape

iOS (Apple Silicon)

Android

Inference Frameworks Compared

CoreML (iOS only)

ONNX Runtime

TensorFlow Lite

llama.cpp

Platform Summary

What Runs on a Phone Today

Speech Recognition (ASR)

Text-to-Speech (TTS)

Language Models (LLM)

Embeddings

Integration Patterns

Pattern 1: Download on First Launch

Pattern 2: Bundle in App

Pattern 3: Lazy Loading With Fallback

Memory Management on Mobile

Performance Optimization

Use Hardware Acceleration

Profile Before Optimizing

Quantize Aggressively

Getting Started

Related

Related articles

On-Device AI: The Complete Guide to Running ML Models Locally

Edge AI: Why Edge-First Should Be Your Default Architecture

Private AI: How to Run AI Models Without Sending Data to the Cloud