← Back to blog Engineering

How to Run LLMs Locally: A Complete Guide (2026)

How to run large language models locally on your laptop, desktop, or phone — llama.cpp, Ollama, ONNX Runtime, and on-device options compared. No cloud API needed.

Glenn Sonna

· June 9, 2026 · 12 min read

local-llmrun-llm-locallyoffline-aiprivate-llmon-device-ailocal-ai

You do not need a cloud API key to use a large language model. Models like Llama 3.2, Qwen 3.5, Phi-3, and Gemma 2 run on consumer hardware — your laptop, your desktop, even your phone. No internet required, no per-token billing, no data leaving your machine.

This guide covers everything you need to start running LLMs locally: which models fit on which hardware, the major runtimes and tools available, how to choose between them, and what performance to expect.

Why Run LLMs Locally

Four reasons drive most people to local LLM inference:

Privacy. Your prompts and conversations stay on your machine. No data is transmitted to OpenAI, Anthropic, Google, or any other provider. For personal use, this means your journal entries, medical questions, and private thoughts never leave your device. For enterprise use, it means proprietary code, internal documents, and customer data stay within your infrastructure.

Cost. Cloud LLM APIs charge per token. At scale, this adds up quickly. Local inference has a one-time cost (downloading the model) and zero marginal cost per token. If you run hundreds of queries per day, local inference pays for itself within the first week.

Latency. Local inference eliminates network round-trips. For applications that need real-time responses — voice assistants, code completion, interactive chat — removing 100-300ms of network overhead makes a noticeable difference.

Control. You choose the model, the quantization level, the system prompt, and the inference parameters. No rate limits, no content filtering you did not ask for, no surprise model deprecations. The model runs when you tell it to, exactly how you configured it.

Hardware Requirements

The constraining factor for local LLMs is memory. A language model must fit entirely in RAM (or VRAM) during inference. Here is what you need:

Desktop and Laptop

Model Size	RAM Required (Q4 quantized)	Example Models	Hardware
0.5-1B params	1-2 GB	Qwen 3.5 0.8B, SmolLM2	Any modern laptop
1-3B params	2-3 GB	Llama 3.2 1B, Llama 3.2 3B, Phi-3 Mini	Any laptop with 8 GB RAM
7-8B params	5-6 GB	Llama 3.1 8B, Mistral 7B, Gemma 2 9B	Laptop with 16 GB RAM
13B params	8-10 GB	Llama 2 13B, CodeLlama 13B	Laptop with 16+ GB RAM
30-34B params	20-24 GB	CodeLlama 34B, Yi 34B	Desktop with 32 GB RAM or GPU with 24 GB VRAM
70B params	40-48 GB	Llama 3.1 70B	High-end workstation or multi-GPU setup

Apple Silicon (M1/M2/M3/M4) is particularly good for local LLMs because the unified memory architecture means CPU and GPU share the same RAM pool. A MacBook with 16 GB of unified memory can run 7-8B models fluently in the GPU.

NVIDIA GPUs with VRAM offload the entire model to the GPU for maximum speed. An RTX 4090 (24 GB VRAM) handles up to 13B models comfortably at full GPU speed, or 30B+ models with partial CPU offload.

CPU-only works for models up to 13B with acceptable speed (5-15 tokens/second). Beyond that, an integrated or discrete GPU is strongly recommended.

Mobile Devices

Device Tier	Available Memory	Max Model Size	Example Models
Flagship 2024+ (iPhone 15 Pro, Pixel 8 Pro)	6-8 GB total (~3-4 GB for models)	1-3B Q4	Llama 3.2 1B, Qwen 3.5 0.8B
Mid-range 2024+	4-6 GB total (~2-3 GB for models)	0.5-1B Q4	SmolLM2, Qwen 3.5 0.8B
Older / budget phones	Under 4 GB total	Not recommended	—

Mobile inference is real but constrained. A 1B parameter model quantized to Q4 generates 10-20 tokens per second on a flagship phone — usable for chat, summarization, and simple question-answering. Models larger than 3B are impractical on current mobile hardware.

Model Formats

Local LLM models come in several formats. The format determines which runtimes can load the model and what optimizations are available.

GGUF

The dominant format for local LLM inference. GGUF files are self-contained — the model weights, tokenizer, and metadata are in a single file. llama.cpp, Ollama, LM Studio, and most local LLM tools use GGUF.

GGUF supports multiple quantization levels:

Quantization	Bits Per Weight	Quality	Speed	Use Case
Q2_K	~2.5	Low	Fastest	Experimentation only
Q4_K_M	~4.5	Good	Fast	Recommended default
Q5_K_M	~5.5	Very good	Moderate	When quality matters more than speed
Q6_K	~6.5	Near-FP16	Slower	Quality-sensitive tasks
Q8_0	8	Excellent	Slow	When you have the memory
FP16	16	Full	Slowest	Benchmarking only

Start with Q4_K_M. It offers the best balance of quality, speed, and memory usage for most models and most use cases.

ONNX

Used by ONNX Runtime for cross-platform inference. ONNX models can leverage platform-specific accelerators (CoreML on Apple, DirectML on Windows, NNAPI on Android). Less common for LLMs but well-suited for smaller models and non-LLM tasks (ASR, TTS, embeddings, classification).

SafeTensors

The native format for Hugging Face models. Used by transformers, Candle (Rust), and other Python-ecosystem tools. Not directly loadable by llama.cpp or Ollama — you need to convert to GGUF first.

Runtime Options Compared

Several tools exist for running LLMs locally. They serve different use cases.

llama.cpp

The foundational project for local LLM inference. A C/C++ library with no dependencies that runs on CPU, CUDA, Metal, Vulkan, and SYCL.

Best for: Developers who need maximum control, custom integrations, embedding in other applications, mobile deployment.

Strengths:

Fastest CPU inference of any runtime
Metal acceleration on Apple Silicon
CUDA acceleration on NVIDIA GPUs
Used as the backend by Ollama, LM Studio, and most other tools
C API for embedding in any language

Tradeoffs:

No GUI — command-line and API only
Compiling from source for optimal performance
Configuration requires understanding quantization and inference parameters

Ollama

A user-friendly wrapper around llama.cpp. Install, ollama pull llama3.2, ollama run llama3.2. That is it.

Best for: Getting started quickly, running models as a local API server, replacing cloud API calls with a local endpoint.

Strengths:

Simplest installation and model management
OpenAI-compatible API server (drop-in replacement for many applications)
Model library with one-command downloads
Good defaults for most parameters

Tradeoffs:

Less control over inference parameters than raw llama.cpp
Abstracts away model format details (harder to use custom GGUF files)
Desktop only — no mobile support

LM Studio

A desktop application with a GUI for browsing, downloading, and chatting with local models.

Best for: Non-developers, experimentation, evaluating different models quickly.

Strengths:

Visual interface for model management
Built-in chat UI
Local API server (OpenAI-compatible)
Model discovery and download from HuggingFace

Tradeoffs:

Desktop only (macOS, Windows, Linux)
Not embeddable in other applications
Closed source

Embedding in Mobile and Cross-Platform Apps

For shipping local LLMs inside an application (mobile, desktop, or embedded), you need a runtime that can be linked as a library, not run as a separate process.

This is where llama.cpp’s C API becomes essential. It can be compiled for iOS, Android, macOS, Windows, and Linux, then called from your application code through FFI bindings. The alternative is ONNX Runtime, which provides similar cross-platform coverage but is more commonly used for non-LLM models.

The challenge is packaging: you need to bundle the model with your app or download it on first launch, manage model storage on constrained devices, and handle memory pressure gracefully. This is non-trivial engineering that most developers should not rebuild from scratch.

Choosing a Model

The “best” local LLM depends on your hardware and use case. Here are practical recommendations:

For Chat and General Assistance

Hardware	Recommended Model	Why
8 GB laptop	Llama 3.2 3B Q4_K_M	Best quality at this memory budget
16 GB laptop	Llama 3.1 8B Q4_K_M	Best general-purpose local model
32 GB desktop	Llama 3.1 70B Q4_K_M	Cloud-competitive quality
Mobile (flagship)	Llama 3.2 1B Q4_K_M	Best mobile chat model

For Code

Hardware	Recommended Model	Why
16 GB laptop	CodeLlama 7B Q4_K_M or DeepSeek Coder 6.7B	Trained specifically for code generation
32 GB desktop	CodeLlama 34B Q4_K_M	Closest to cloud code assistants

For Summarization and Extraction

Smaller models (1-3B) handle summarization and information extraction well because the task is largely extractive rather than generative. A Llama 3.2 1B can summarize a document nearly as well as a 70B model for straightforward summarization tasks.

Performance Expectations

Realistic token generation speeds on common hardware:

Model	Hardware	Quantization	Tokens/sec	Experience
Llama 3.2 1B	iPhone 15 Pro	Q4_K_M	15-25	Usable chat
Llama 3.2 3B	M2 MacBook Air	Q4_K_M	30-50	Fluent
Llama 3.1 8B	M3 MacBook Pro	Q4_K_M	25-40	Fluent
Llama 3.1 8B	RTX 4090	Q4_K_M	80-120	Instant
Llama 3.1 70B	M3 Max (48 GB)	Q4_K_M	8-15	Readable
Llama 3.1 70B	RTX 4090 (24 GB)	Q4_K_M	15-25 (partial offload)	Usable

For conversational use, 10+ tokens/second feels responsive. Anything above 20 tokens/second feels fluent. Below 5 tokens/second, users notice the delay and it starts to feel sluggish.

Getting Started (5 Minutes)

The fastest path from zero to running a local LLM:

Option 1: Ollama (Simplest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama run llama3.2

# Or run as an API server
ollama serve
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Hello"}'

Option 2: llama.cpp (Most Control)

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Download a model (GGUF format from HuggingFace)
# Then run:
./llama-cli -m path/to/model.gguf -p "Hello, how are you?" -n 256

Option 3: In a Mobile or Desktop App

For embedding a local LLM in your own application, you need an SDK that handles model management, platform-specific compilation, and runtime abstraction. This is more involved than running a CLI tool, but it gives you full control over the user experience.

The general pattern:

Bundle or download the GGUF model on first launch
Load the model into memory using llama.cpp (or an abstraction over it)
Run inference through the library’s API
Stream tokens to your UI as they are generated

Limitations to Know

Context length is memory-bound. A longer context window requires more memory. A 7B model with a 4K context fits in 6 GB. The same model with a 32K context needs 10+ GB. If memory is tight, reduce the context length.

First-token latency is real. The first token takes longer than subsequent tokens because the model must process the entire prompt (prefill). For a 2,000-token prompt on a 7B model, expect 1-3 seconds before the first token appears. Subsequent tokens stream at the speeds listed above.

Not all tasks benefit. Tasks that require vast world knowledge, multi-step reasoning chains, or real-time information retrieval still favor larger cloud models. Local LLMs excel at focused tasks: summarization, extraction, code generation, conversational Q&A, and text transformation.

Quantization has limits. Below Q3, most models degrade noticeably. Q4_K_M is the sweet spot. If a model does not perform well at Q4, try the next size up rather than increasing the quantization quality of the same model.

What Is Next for Local LLMs

The trajectory is clear: models are getting smaller and more capable. Each generation of hardware ships with more NPU/Neural Engine compute. The models that required a desktop last year run on a phone this year.

Within the next year, expect:

3B parameter models that match today’s 7B quality
Dedicated NPU inference paths on iOS and Android for LLMs
Better speculative decoding for faster generation on constrained hardware
On-device fine-tuning for personalization

The gap between cloud and local model quality is closing. For the majority of practical AI features — chat, summarization, extraction, code assistance, voice interaction — local inference is already good enough. It will only get better.

On-Device AI: The Complete Guide to Running ML Models Locally — broader guide covering all model types, not just LLMs.
Edge AI vs Cloud AI: When to Run Models On-Device — decision framework for when local inference makes sense vs cloud.

Jun 30, 2026 · 5 min read

Whisper Speech Recognition as a Single Rust Binary

Build a Whisper-powered transcription tool as a single Rust binary. Candle for pure-Rust inference. No runtime dependencies, works airgapped.

whisper-rustoffline-aispeech-recognition

Jun 16, 2026 · 10 min read

Private AI: How to Run AI Models Without Sending Data to the Cloud

How to run AI models privately on-device — no cloud APIs, no data leaving the device. Covers GDPR, HIPAA, and SOC 2 compliance, private LLMs, and practical implementation patterns.

private-aiprivate-llmon-device-ai

Jul 3, 2026 · 6 min read

How We Made ONNX Runtime 6.8x Faster on Apple Silicon with CoreML

Real benchmarks showing when Apple's Neural Engine helps (and when it hurts). Lessons from optimizing ML inference across execution providers.

onnx-runtimecoremlapple-silicon-ml

How to Run LLMs Locally: A Complete Guide (2026)

Why Run LLMs Locally

Hardware Requirements

Desktop and Laptop

Mobile Devices

Model Formats

GGUF

ONNX

SafeTensors

Runtime Options Compared

llama.cpp

Ollama

LM Studio

Embedding in Mobile and Cross-Platform Apps

Choosing a Model

For Chat and General Assistance

For Code

For Summarization and Extraction

Performance Expectations

Getting Started (5 Minutes)

Option 1: Ollama (Simplest)

Option 2: llama.cpp (Most Control)

Option 3: In a Mobile or Desktop App

Limitations to Know

What Is Next for Local LLMs

Related

Related articles

Whisper Speech Recognition as a Single Rust Binary

Private AI: How to Run AI Models Without Sending Data to the Cloud

How We Made ONNX Runtime 6.8x Faster on Apple Silicon with CoreML