Adding AI features to a mobile app used to mean integrating a cloud API. Send user data to a server, wait for a response, pay per request. That model works, but it is no longer the only option — and for many features, it is no longer the best one.
Modern smartphones have hardware specifically designed for ML inference: Apple’s Neural Engine, Qualcomm’s Hexagon DSP, Google’s Tensor TPU. A flagship phone from 2024 can run a speech recognition model in under 100ms, generate natural-sounding speech, classify images in real-time, and even run a 1B parameter language model. All without an internet connection.
This guide covers the practical side of adding on-device AI to iOS and Android apps: which frameworks to use, which models fit on mobile hardware, what performance to expect, and how to handle the engineering challenges of bundling and running models on constrained devices.
The Mobile AI Hardware Landscape
Before choosing a framework, understand what hardware is available to you.
iOS (Apple Silicon)
Every iPhone since the iPhone 8 (2017) includes a Neural Engine — dedicated hardware for ML inference. The capability has scaled dramatically:
| Chip | Neural Engine | TOPS | Available Memory |
|---|---|---|---|
| A11 (iPhone 8/X) | 2-core | 0.6 | 2-3 GB |
| A14 (iPhone 12) | 16-core | 11 | 4 GB |
| A15 (iPhone 13/14) | 16-core | 15.8 | 4-6 GB |
| A16 (iPhone 14 Pro/15) | 16-core | 17 | 6 GB |
| A17 Pro (iPhone 15 Pro) | 16-core | 35 | 8 GB |
| A18 Pro (iPhone 16 Pro) | 16-core | 35+ | 8 GB |
The Neural Engine is accessed through CoreML, Apple’s ML framework. Models that CoreML can optimize for the Neural Engine run dramatically faster than CPU-only inference — we have measured 6.8x speedups on vision models like MobileNet.
The GPU (Metal) is also available for ML workloads that do not map well to the Neural Engine, such as models with dynamic shapes or certain attention patterns.
Android
Android hardware is more varied. The key accelerators:
Qualcomm (Snapdragon): The Hexagon DSP and AI Engine provide ML acceleration. Accessed via ONNX Runtime’s QNN provider or Qualcomm’s AI Hub SDK. Performance varies significantly by chipset generation.
Google Tensor: Custom chips in Pixel phones with dedicated TPU cores. Accessed via TensorFlow Lite with the Google Edge TPU delegate.
MediaTek: APU (AI Processing Unit) available in Dimensity chips. Accessed via the MediaTek NeuroPilot SDK or ONNX Runtime.
Samsung (Exynos): NPU available in some Galaxy devices. Accessed via Samsung’s Neural SDK.
The fragmentation is real. An AI feature that runs great on a Pixel 8 (Tensor G3) might be slow on a mid-range Snapdragon. Practical approach: target CPU inference as the baseline, use hardware acceleration as an optimization for specific chipsets.
| Chipset | Category | AI Performance | Available Memory |
|---|---|---|---|
| Snapdragon 8 Gen 3 | Flagship | 45 TOPS | 8-16 GB |
| Snapdragon 8 Gen 2 | Flagship | 33 TOPS | 8-12 GB |
| Snapdragon 7 Gen 1 | Mid-range | 13 TOPS | 6-8 GB |
| Tensor G3 (Pixel 8) | Flagship | ~20 TOPS | 12 GB |
| Dimensity 9300 | Flagship | 41 TOPS | 8-16 GB |
Inference Frameworks Compared
Five frameworks dominate mobile ML inference. Each has different strengths.
CoreML (iOS only)
Apple’s native ML framework. First-class access to the Neural Engine, GPU (Metal), and CPU. Tightly integrated with Xcode and the Apple ecosystem.
Strengths:
- Best Neural Engine utilization on iOS
- Optimized for Apple Silicon
- Supports model conversion from ONNX, TensorFlow, and PyTorch
- Xcode integration with model preview and profiling
Tradeoffs:
- iOS and macOS only
- Limited model format support (must convert to .mlmodel/.mlpackage)
- Less control over execution providers than ONNX Runtime
- Dynamic shapes can force CPU fallback (losing Neural Engine benefits)
Best for: iOS-exclusive apps where maximum performance on Apple hardware matters.
ONNX Runtime
Microsoft’s cross-platform inference engine. Runs on iOS (CoreML backend), Android (NNAPI, QNN), and desktop. Supports the standardized ONNX model format.
Strengths:
- Cross-platform: same model runs on iOS, Android, Windows, macOS, Linux
- Multiple execution providers (CoreML, CUDA, DirectML, NNAPI, QNN)
- Broad model format support
- Active development and optimization
Tradeoffs:
- Binary size is significant (~15-30 MB depending on providers)
- CoreML provider on iOS does not always match native CoreML performance
- NNAPI on Android is deprecated in favor of vendor-specific APIs
Best for: Cross-platform apps that need to run the same model on iOS and Android. Vision, audio, and embedding models.
TensorFlow Lite
Google’s mobile inference framework. Mature ecosystem with good model conversion tools and Android-first optimizations.
Strengths:
- Mature model conversion from TensorFlow/Keras
- Good GPU delegate for Android
- Google Edge TPU support on Pixel devices
- Extensive model zoo and documentation
Tradeoffs:
- TensorFlow ecosystem lock-in
- iOS support exists but is not the primary focus
- Less active development than ONNX Runtime
- Cannot run ONNX models directly
Best for: Android-first apps already in the TensorFlow ecosystem.
llama.cpp
C/C++ library for LLM inference. The de facto standard for running language models on consumer hardware, including mobile.
Strengths:
- Best local LLM inference performance
- Metal acceleration on iOS, Vulkan on Android
- Minimal dependencies, small binary
- GGUF format is the standard for quantized LLMs
Tradeoffs:
- LLMs only (not for vision, audio, or embeddings)
- Requires C/C++ FFI from Swift, Kotlin, Dart, etc.
- Memory management on mobile requires careful handling
Best for: Any app that needs to run an LLM on-device.
Platform Summary
| Feature | CoreML | ONNX Runtime | TF Lite | llama.cpp |
|---|---|---|---|---|
| iOS | Native | Via CoreML EP | Supported | Metal |
| Android | No | NNAPI/QNN | Native | Vulkan |
| Model types | Any | Any | Any | LLMs only |
| LLM support | Limited | Limited | Limited | Excellent |
| Neural Engine | Yes | Via CoreML | No | No |
| Binary size | 0 (system) | 15-30 MB | 5-10 MB | 2-5 MB |
Practical recommendation: Use ONNX Runtime or CoreML for non-LLM models (ASR, TTS, embeddings, vision). Use llama.cpp for LLMs. This gives you the best performance for each model type.
What Runs on a Phone Today
Here are models that run well on current mobile hardware, tested on real devices:
Speech Recognition (ASR)
| Model | Size | iPhone 15 Pro | Pixel 8 Pro | Quality |
|---|---|---|---|---|
| Whisper Tiny | 75 MB | ~60ms/sec audio | ~90ms/sec audio | Good for English |
| Whisper Base | 140 MB | ~120ms/sec audio | ~180ms/sec audio | Better multilingual |
| Whisper Small | 460 MB | ~400ms/sec audio | ~600ms/sec audio | Best mobile quality |
Whisper Tiny processes audio faster than real-time on any modern phone. It handles English accurately and is the best choice for real-time transcription on mobile.
Text-to-Speech (TTS)
| Model | Size | iPhone 15 Pro | Pixel 8 Pro | Quality |
|---|---|---|---|---|
| Kokoro 82M | 180 MB | ~0.8s per sentence | ~1.2s per sentence | Natural, expressive |
| Piper (onnx) | 60-100 MB | ~0.3s per sentence | ~0.5s per sentence | Fast, slightly robotic |
Kokoro produces remarkably natural speech for its size. The slight synthesis delay is acceptable for most use cases — users do not notice a sub-second pause before speech begins.
Language Models (LLM)
| Model | Size (Q4) | iPhone 15 Pro | Pixel 8 Pro | Quality |
|---|---|---|---|---|
| Qwen 3.5 0.8B | ~500 MB | ~20 tok/s | ~12 tok/s | Good for simple tasks |
| Llama 3.2 1B | ~700 MB | ~18 tok/s | ~10 tok/s | Better reasoning |
| Llama 3.2 3B | ~1.8 GB | ~8 tok/s | ~5 tok/s | Desktop-like quality |
The 1B models are the sweet spot for mobile: fast enough for conversational use, small enough to download over cellular, and capable enough for summarization, Q&A, and simple chat.
Embeddings
| Model | Size | iPhone 15 Pro | Pixel 8 Pro | Use Case |
|---|---|---|---|---|
| MiniLM-L6 | 80 MB | ~5ms per query | ~8ms per query | Semantic search |
| all-MiniLM-L6 | 80 MB | ~5ms per query | ~8ms per query | Similarity matching |
Embedding models are small, fast, and enable powerful features: on-device semantic search, document clustering, and RAG without sending documents to the cloud.
Integration Patterns
Pattern 1: Download on First Launch
The most common approach. The app ships without models. On first launch, it downloads the required model(s) from a CDN.
Pros: Small initial app size. Models can be updated independently of the app. Cons: First-launch experience requires download and wait time. Needs network on first use.
Implementation notes:
- Show a progress indicator with estimated time
- Download in the background if possible
- Cache the model in the app’s documents directory (persists across updates)
- Verify integrity with a SHA-256 checksum after download
Pattern 2: Bundle in App
Ship the model file inside the app binary. The app works immediately with no download step.
Pros: Works offline from first launch. No download management code. Cons: Increases app binary size significantly. App Store size limits apply (iOS: 4 GB, but >200 MB triggers a warning. Google Play: 150 MB AAB + 2 GB expansion files).
Best for: Apps where the model is small (under 50 MB) or where offline-first is a hard requirement.
Pattern 3: Lazy Loading With Fallback
Load models on demand when the user first accesses the feature. If the model is not available, show a download prompt or fall back to a cloud API.
Pros: Minimizes initial footprint. Only downloads what the user actually uses. Cons: Feature is not available until model is downloaded. Requires cloud fallback or graceful degradation.
Memory Management on Mobile
Mobile devices have strict memory limits. iOS will kill your app if it uses too much memory. Android has similar constraints.
Rules of thumb:
- Keep total model memory under 50% of available RAM
- Unload models when not actively in use
- Use lazy loading — do not load all models at app startup
- Monitor memory warnings (iOS
didReceiveMemoryWarning, AndroidonTrimMemory) - Prefer quantized models (Q4) to reduce memory footprint
For LLMs, the context window length directly impacts memory usage. A 1B model with a 2K context uses significantly less memory than the same model with an 8K context. Cap the context length to what the use case actually requires.
Performance Optimization
Use Hardware Acceleration
- iOS: CoreML with Neural Engine for vision/audio models. Metal for LLMs via llama.cpp.
- Android: NNAPI or vendor-specific delegates for vision/audio. Vulkan for LLMs via llama.cpp.
Hardware acceleration can provide 2-10x speedups over CPU-only inference, depending on the model architecture.
Profile Before Optimizing
Use platform profiling tools to identify bottlenecks:
- iOS: Xcode Instruments (Neural Engine, GPU, CPU profiling)
- Android: Android Studio Profiler, Perfetto for system-level tracing
Common bottleneck: preprocessing (decoding audio, resizing images) can take longer than inference itself. Optimize the full pipeline, not just the model execution.
Quantize Aggressively
Q4_K_M quantization reduces model size by 4x compared to FP16 with minimal quality loss. For mobile, always use quantized models. The quality difference is negligible for most tasks, and the memory and speed improvements are substantial.
Getting Started
Pick your model. Start with one model for one feature. Speech recognition (Whisper Tiny), TTS (Kokoro), embeddings (MiniLM), or chat (Llama 3.2 1B) are all proven on mobile.
Pick your runtime. ONNX Runtime for non-LLM models, llama.cpp for LLMs. Both work on iOS and Android.
Prototype on-device. Get the model running on a physical device (not a simulator). Measure actual latency and memory usage.
Handle the packaging. Decide between download-on-first-launch and bundled. For models over 50 MB, download-on-first-launch is usually the better UX.
Test on lower-end devices. Your model runs fast on an iPhone 15 Pro. Does it run acceptably on an iPhone 12? On a Pixel 6a? Define your minimum supported hardware and benchmark there.
The hardware is ready. The models are small enough. The frameworks are mature enough. On-device AI on mobile is not a future possibility — it is a present capability waiting for developers to use it.
Related
- On-Device AI: The Complete Guide to Running ML Models Locally — the broader guide covering desktop and mobile inference.
- How to Run LLMs Locally — deep dive into local LLM runtimes, quantization, and model selection.
- Add Text-to-Speech to Your Flutter App in 15 Minutes — hands-on tutorial for on-device TTS in Flutter.
- Building a Voice Agent That Runs Entirely On-Device — full voice pipeline tutorial: ASR + LLM + TTS on mobile.