← Back to blog Engineering

On-Device AI for Mobile Apps: A Developer's Guide to iOS and Android

How to add on-device AI to iOS and Android apps — comparing CoreML, ONNX Runtime, TensorFlow Lite, and llama.cpp. Practical guide with model sizes, performance benchmarks, and integration patterns.

Glenn Sonna
· · 12 min read
on-device-aimobile-aiai-on-iphoneai-on-androidcoremlmobile-ml

Adding AI features to a mobile app used to mean integrating a cloud API. Send user data to a server, wait for a response, pay per request. That model works, but it is no longer the only option — and for many features, it is no longer the best one.

Modern smartphones have hardware specifically designed for ML inference: Apple’s Neural Engine, Qualcomm’s Hexagon DSP, Google’s Tensor TPU. A flagship phone from 2024 can run a speech recognition model in under 100ms, generate natural-sounding speech, classify images in real-time, and even run a 1B parameter language model. All without an internet connection.

This guide covers the practical side of adding on-device AI to iOS and Android apps: which frameworks to use, which models fit on mobile hardware, what performance to expect, and how to handle the engineering challenges of bundling and running models on constrained devices.

The Mobile AI Hardware Landscape

Before choosing a framework, understand what hardware is available to you.

iOS (Apple Silicon)

Every iPhone since the iPhone 8 (2017) includes a Neural Engine — dedicated hardware for ML inference. The capability has scaled dramatically:

ChipNeural EngineTOPSAvailable Memory
A11 (iPhone 8/X)2-core0.62-3 GB
A14 (iPhone 12)16-core114 GB
A15 (iPhone 13/14)16-core15.84-6 GB
A16 (iPhone 14 Pro/15)16-core176 GB
A17 Pro (iPhone 15 Pro)16-core358 GB
A18 Pro (iPhone 16 Pro)16-core35+8 GB

The Neural Engine is accessed through CoreML, Apple’s ML framework. Models that CoreML can optimize for the Neural Engine run dramatically faster than CPU-only inference — we have measured 6.8x speedups on vision models like MobileNet.

The GPU (Metal) is also available for ML workloads that do not map well to the Neural Engine, such as models with dynamic shapes or certain attention patterns.

Android

Android hardware is more varied. The key accelerators:

Qualcomm (Snapdragon): The Hexagon DSP and AI Engine provide ML acceleration. Accessed via ONNX Runtime’s QNN provider or Qualcomm’s AI Hub SDK. Performance varies significantly by chipset generation.

Google Tensor: Custom chips in Pixel phones with dedicated TPU cores. Accessed via TensorFlow Lite with the Google Edge TPU delegate.

MediaTek: APU (AI Processing Unit) available in Dimensity chips. Accessed via the MediaTek NeuroPilot SDK or ONNX Runtime.

Samsung (Exynos): NPU available in some Galaxy devices. Accessed via Samsung’s Neural SDK.

The fragmentation is real. An AI feature that runs great on a Pixel 8 (Tensor G3) might be slow on a mid-range Snapdragon. Practical approach: target CPU inference as the baseline, use hardware acceleration as an optimization for specific chipsets.

ChipsetCategoryAI PerformanceAvailable Memory
Snapdragon 8 Gen 3Flagship45 TOPS8-16 GB
Snapdragon 8 Gen 2Flagship33 TOPS8-12 GB
Snapdragon 7 Gen 1Mid-range13 TOPS6-8 GB
Tensor G3 (Pixel 8)Flagship~20 TOPS12 GB
Dimensity 9300Flagship41 TOPS8-16 GB

Inference Frameworks Compared

Five frameworks dominate mobile ML inference. Each has different strengths.

CoreML (iOS only)

Apple’s native ML framework. First-class access to the Neural Engine, GPU (Metal), and CPU. Tightly integrated with Xcode and the Apple ecosystem.

Strengths:

  • Best Neural Engine utilization on iOS
  • Optimized for Apple Silicon
  • Supports model conversion from ONNX, TensorFlow, and PyTorch
  • Xcode integration with model preview and profiling

Tradeoffs:

  • iOS and macOS only
  • Limited model format support (must convert to .mlmodel/.mlpackage)
  • Less control over execution providers than ONNX Runtime
  • Dynamic shapes can force CPU fallback (losing Neural Engine benefits)

Best for: iOS-exclusive apps where maximum performance on Apple hardware matters.

ONNX Runtime

Microsoft’s cross-platform inference engine. Runs on iOS (CoreML backend), Android (NNAPI, QNN), and desktop. Supports the standardized ONNX model format.

Strengths:

  • Cross-platform: same model runs on iOS, Android, Windows, macOS, Linux
  • Multiple execution providers (CoreML, CUDA, DirectML, NNAPI, QNN)
  • Broad model format support
  • Active development and optimization

Tradeoffs:

  • Binary size is significant (~15-30 MB depending on providers)
  • CoreML provider on iOS does not always match native CoreML performance
  • NNAPI on Android is deprecated in favor of vendor-specific APIs

Best for: Cross-platform apps that need to run the same model on iOS and Android. Vision, audio, and embedding models.

TensorFlow Lite

Google’s mobile inference framework. Mature ecosystem with good model conversion tools and Android-first optimizations.

Strengths:

  • Mature model conversion from TensorFlow/Keras
  • Good GPU delegate for Android
  • Google Edge TPU support on Pixel devices
  • Extensive model zoo and documentation

Tradeoffs:

  • TensorFlow ecosystem lock-in
  • iOS support exists but is not the primary focus
  • Less active development than ONNX Runtime
  • Cannot run ONNX models directly

Best for: Android-first apps already in the TensorFlow ecosystem.

llama.cpp

C/C++ library for LLM inference. The de facto standard for running language models on consumer hardware, including mobile.

Strengths:

  • Best local LLM inference performance
  • Metal acceleration on iOS, Vulkan on Android
  • Minimal dependencies, small binary
  • GGUF format is the standard for quantized LLMs

Tradeoffs:

  • LLMs only (not for vision, audio, or embeddings)
  • Requires C/C++ FFI from Swift, Kotlin, Dart, etc.
  • Memory management on mobile requires careful handling

Best for: Any app that needs to run an LLM on-device.

Platform Summary

FeatureCoreMLONNX RuntimeTF Litellama.cpp
iOSNativeVia CoreML EPSupportedMetal
AndroidNoNNAPI/QNNNativeVulkan
Model typesAnyAnyAnyLLMs only
LLM supportLimitedLimitedLimitedExcellent
Neural EngineYesVia CoreMLNoNo
Binary size0 (system)15-30 MB5-10 MB2-5 MB

Practical recommendation: Use ONNX Runtime or CoreML for non-LLM models (ASR, TTS, embeddings, vision). Use llama.cpp for LLMs. This gives you the best performance for each model type.

What Runs on a Phone Today

Here are models that run well on current mobile hardware, tested on real devices:

Speech Recognition (ASR)

ModelSizeiPhone 15 ProPixel 8 ProQuality
Whisper Tiny75 MB~60ms/sec audio~90ms/sec audioGood for English
Whisper Base140 MB~120ms/sec audio~180ms/sec audioBetter multilingual
Whisper Small460 MB~400ms/sec audio~600ms/sec audioBest mobile quality

Whisper Tiny processes audio faster than real-time on any modern phone. It handles English accurately and is the best choice for real-time transcription on mobile.

Text-to-Speech (TTS)

ModelSizeiPhone 15 ProPixel 8 ProQuality
Kokoro 82M180 MB~0.8s per sentence~1.2s per sentenceNatural, expressive
Piper (onnx)60-100 MB~0.3s per sentence~0.5s per sentenceFast, slightly robotic

Kokoro produces remarkably natural speech for its size. The slight synthesis delay is acceptable for most use cases — users do not notice a sub-second pause before speech begins.

Language Models (LLM)

ModelSize (Q4)iPhone 15 ProPixel 8 ProQuality
Qwen 3.5 0.8B~500 MB~20 tok/s~12 tok/sGood for simple tasks
Llama 3.2 1B~700 MB~18 tok/s~10 tok/sBetter reasoning
Llama 3.2 3B~1.8 GB~8 tok/s~5 tok/sDesktop-like quality

The 1B models are the sweet spot for mobile: fast enough for conversational use, small enough to download over cellular, and capable enough for summarization, Q&A, and simple chat.

Embeddings

ModelSizeiPhone 15 ProPixel 8 ProUse Case
MiniLM-L680 MB~5ms per query~8ms per querySemantic search
all-MiniLM-L680 MB~5ms per query~8ms per querySimilarity matching

Embedding models are small, fast, and enable powerful features: on-device semantic search, document clustering, and RAG without sending documents to the cloud.

Integration Patterns

Pattern 1: Download on First Launch

The most common approach. The app ships without models. On first launch, it downloads the required model(s) from a CDN.

Pros: Small initial app size. Models can be updated independently of the app. Cons: First-launch experience requires download and wait time. Needs network on first use.

Implementation notes:

  • Show a progress indicator with estimated time
  • Download in the background if possible
  • Cache the model in the app’s documents directory (persists across updates)
  • Verify integrity with a SHA-256 checksum after download

Pattern 2: Bundle in App

Ship the model file inside the app binary. The app works immediately with no download step.

Pros: Works offline from first launch. No download management code. Cons: Increases app binary size significantly. App Store size limits apply (iOS: 4 GB, but >200 MB triggers a warning. Google Play: 150 MB AAB + 2 GB expansion files).

Best for: Apps where the model is small (under 50 MB) or where offline-first is a hard requirement.

Pattern 3: Lazy Loading With Fallback

Load models on demand when the user first accesses the feature. If the model is not available, show a download prompt or fall back to a cloud API.

Pros: Minimizes initial footprint. Only downloads what the user actually uses. Cons: Feature is not available until model is downloaded. Requires cloud fallback or graceful degradation.

Memory Management on Mobile

Mobile devices have strict memory limits. iOS will kill your app if it uses too much memory. Android has similar constraints.

Rules of thumb:

  • Keep total model memory under 50% of available RAM
  • Unload models when not actively in use
  • Use lazy loading — do not load all models at app startup
  • Monitor memory warnings (iOS didReceiveMemoryWarning, Android onTrimMemory)
  • Prefer quantized models (Q4) to reduce memory footprint

For LLMs, the context window length directly impacts memory usage. A 1B model with a 2K context uses significantly less memory than the same model with an 8K context. Cap the context length to what the use case actually requires.

Performance Optimization

Use Hardware Acceleration

  • iOS: CoreML with Neural Engine for vision/audio models. Metal for LLMs via llama.cpp.
  • Android: NNAPI or vendor-specific delegates for vision/audio. Vulkan for LLMs via llama.cpp.

Hardware acceleration can provide 2-10x speedups over CPU-only inference, depending on the model architecture.

Profile Before Optimizing

Use platform profiling tools to identify bottlenecks:

  • iOS: Xcode Instruments (Neural Engine, GPU, CPU profiling)
  • Android: Android Studio Profiler, Perfetto for system-level tracing

Common bottleneck: preprocessing (decoding audio, resizing images) can take longer than inference itself. Optimize the full pipeline, not just the model execution.

Quantize Aggressively

Q4_K_M quantization reduces model size by 4x compared to FP16 with minimal quality loss. For mobile, always use quantized models. The quality difference is negligible for most tasks, and the memory and speed improvements are substantial.

Getting Started

  1. Pick your model. Start with one model for one feature. Speech recognition (Whisper Tiny), TTS (Kokoro), embeddings (MiniLM), or chat (Llama 3.2 1B) are all proven on mobile.

  2. Pick your runtime. ONNX Runtime for non-LLM models, llama.cpp for LLMs. Both work on iOS and Android.

  3. Prototype on-device. Get the model running on a physical device (not a simulator). Measure actual latency and memory usage.

  4. Handle the packaging. Decide between download-on-first-launch and bundled. For models over 50 MB, download-on-first-launch is usually the better UX.

  5. Test on lower-end devices. Your model runs fast on an iPhone 15 Pro. Does it run acceptably on an iPhone 12? On a Pixel 6a? Define your minimum supported hardware and benchmark there.

The hardware is ready. The models are small enough. The frameworks are mature enough. On-device AI on mobile is not a future possibility — it is a present capability waiting for developers to use it.


Related

Related articles

· 12 min read

On-Device AI: The Complete Guide to Running ML Models Locally

Everything you need to know about running machine learning models directly on mobile and desktop devices — privacy, latency, cost benefits, and how to get started.

on-device-aiedge-inferencemobile-ml
· 7 min read

Edge AI: Why Edge-First Should Be Your Default Architecture

On-device AI should be the starting position, not the optimization. The industry defaults to cloud out of habit, not necessity.

edge-aion-device-aiedge-first
· 10 min read

Private AI: How to Run AI Models Without Sending Data to the Cloud

How to run AI models privately on-device — no cloud APIs, no data leaving the device. Covers GDPR, HIPAA, and SOC 2 compliance, private LLMs, and practical implementation patterns.

private-aiprivate-llmon-device-ai