← Back to blog Engineering

How to Run LLMs Locally: A Complete Guide (2026)

How to run large language models locally on your laptop, desktop, or phone — llama.cpp, Ollama, ONNX Runtime, and on-device options compared. No cloud API needed.

Glenn Sonna
· · 12 min read
local-llmrun-llm-locallyoffline-aiprivate-llmon-device-ailocal-ai

You do not need a cloud API key to use a large language model. Models like Llama 3.2, Qwen 3.5, Phi-3, and Gemma 2 run on consumer hardware — your laptop, your desktop, even your phone. No internet required, no per-token billing, no data leaving your machine.

This guide covers everything you need to start running LLMs locally: which models fit on which hardware, the major runtimes and tools available, how to choose between them, and what performance to expect.

Why Run LLMs Locally

Four reasons drive most people to local LLM inference:

Privacy. Your prompts and conversations stay on your machine. No data is transmitted to OpenAI, Anthropic, Google, or any other provider. For personal use, this means your journal entries, medical questions, and private thoughts never leave your device. For enterprise use, it means proprietary code, internal documents, and customer data stay within your infrastructure.

Cost. Cloud LLM APIs charge per token. At scale, this adds up quickly. Local inference has a one-time cost (downloading the model) and zero marginal cost per token. If you run hundreds of queries per day, local inference pays for itself within the first week.

Latency. Local inference eliminates network round-trips. For applications that need real-time responses — voice assistants, code completion, interactive chat — removing 100-300ms of network overhead makes a noticeable difference.

Control. You choose the model, the quantization level, the system prompt, and the inference parameters. No rate limits, no content filtering you did not ask for, no surprise model deprecations. The model runs when you tell it to, exactly how you configured it.

Hardware Requirements

The constraining factor for local LLMs is memory. A language model must fit entirely in RAM (or VRAM) during inference. Here is what you need:

Desktop and Laptop

Model SizeRAM Required (Q4 quantized)Example ModelsHardware
0.5-1B params1-2 GBQwen 3.5 0.8B, SmolLM2Any modern laptop
1-3B params2-3 GBLlama 3.2 1B, Llama 3.2 3B, Phi-3 MiniAny laptop with 8 GB RAM
7-8B params5-6 GBLlama 3.1 8B, Mistral 7B, Gemma 2 9BLaptop with 16 GB RAM
13B params8-10 GBLlama 2 13B, CodeLlama 13BLaptop with 16+ GB RAM
30-34B params20-24 GBCodeLlama 34B, Yi 34BDesktop with 32 GB RAM or GPU with 24 GB VRAM
70B params40-48 GBLlama 3.1 70BHigh-end workstation or multi-GPU setup

Apple Silicon (M1/M2/M3/M4) is particularly good for local LLMs because the unified memory architecture means CPU and GPU share the same RAM pool. A MacBook with 16 GB of unified memory can run 7-8B models fluently in the GPU.

NVIDIA GPUs with VRAM offload the entire model to the GPU for maximum speed. An RTX 4090 (24 GB VRAM) handles up to 13B models comfortably at full GPU speed, or 30B+ models with partial CPU offload.

CPU-only works for models up to 13B with acceptable speed (5-15 tokens/second). Beyond that, an integrated or discrete GPU is strongly recommended.

Mobile Devices

Device TierAvailable MemoryMax Model SizeExample Models
Flagship 2024+ (iPhone 15 Pro, Pixel 8 Pro)6-8 GB total (~3-4 GB for models)1-3B Q4Llama 3.2 1B, Qwen 3.5 0.8B
Mid-range 2024+4-6 GB total (~2-3 GB for models)0.5-1B Q4SmolLM2, Qwen 3.5 0.8B
Older / budget phonesUnder 4 GB totalNot recommended

Mobile inference is real but constrained. A 1B parameter model quantized to Q4 generates 10-20 tokens per second on a flagship phone — usable for chat, summarization, and simple question-answering. Models larger than 3B are impractical on current mobile hardware.

Model Formats

Local LLM models come in several formats. The format determines which runtimes can load the model and what optimizations are available.

GGUF

The dominant format for local LLM inference. GGUF files are self-contained — the model weights, tokenizer, and metadata are in a single file. llama.cpp, Ollama, LM Studio, and most local LLM tools use GGUF.

GGUF supports multiple quantization levels:

QuantizationBits Per WeightQualitySpeedUse Case
Q2_K~2.5LowFastestExperimentation only
Q4_K_M~4.5GoodFastRecommended default
Q5_K_M~5.5Very goodModerateWhen quality matters more than speed
Q6_K~6.5Near-FP16SlowerQuality-sensitive tasks
Q8_08ExcellentSlowWhen you have the memory
FP1616FullSlowestBenchmarking only

Start with Q4_K_M. It offers the best balance of quality, speed, and memory usage for most models and most use cases.

ONNX

Used by ONNX Runtime for cross-platform inference. ONNX models can leverage platform-specific accelerators (CoreML on Apple, DirectML on Windows, NNAPI on Android). Less common for LLMs but well-suited for smaller models and non-LLM tasks (ASR, TTS, embeddings, classification).

SafeTensors

The native format for Hugging Face models. Used by transformers, Candle (Rust), and other Python-ecosystem tools. Not directly loadable by llama.cpp or Ollama — you need to convert to GGUF first.

Runtime Options Compared

Several tools exist for running LLMs locally. They serve different use cases.

llama.cpp

The foundational project for local LLM inference. A C/C++ library with no dependencies that runs on CPU, CUDA, Metal, Vulkan, and SYCL.

Best for: Developers who need maximum control, custom integrations, embedding in other applications, mobile deployment.

Strengths:

  • Fastest CPU inference of any runtime
  • Metal acceleration on Apple Silicon
  • CUDA acceleration on NVIDIA GPUs
  • Used as the backend by Ollama, LM Studio, and most other tools
  • C API for embedding in any language

Tradeoffs:

  • No GUI — command-line and API only
  • Compiling from source for optimal performance
  • Configuration requires understanding quantization and inference parameters

Ollama

A user-friendly wrapper around llama.cpp. Install, ollama pull llama3.2, ollama run llama3.2. That is it.

Best for: Getting started quickly, running models as a local API server, replacing cloud API calls with a local endpoint.

Strengths:

  • Simplest installation and model management
  • OpenAI-compatible API server (drop-in replacement for many applications)
  • Model library with one-command downloads
  • Good defaults for most parameters

Tradeoffs:

  • Less control over inference parameters than raw llama.cpp
  • Abstracts away model format details (harder to use custom GGUF files)
  • Desktop only — no mobile support

LM Studio

A desktop application with a GUI for browsing, downloading, and chatting with local models.

Best for: Non-developers, experimentation, evaluating different models quickly.

Strengths:

  • Visual interface for model management
  • Built-in chat UI
  • Local API server (OpenAI-compatible)
  • Model discovery and download from HuggingFace

Tradeoffs:

  • Desktop only (macOS, Windows, Linux)
  • Not embeddable in other applications
  • Closed source

Embedding in Mobile and Cross-Platform Apps

For shipping local LLMs inside an application (mobile, desktop, or embedded), you need a runtime that can be linked as a library, not run as a separate process.

This is where llama.cpp’s C API becomes essential. It can be compiled for iOS, Android, macOS, Windows, and Linux, then called from your application code through FFI bindings. The alternative is ONNX Runtime, which provides similar cross-platform coverage but is more commonly used for non-LLM models.

The challenge is packaging: you need to bundle the model with your app or download it on first launch, manage model storage on constrained devices, and handle memory pressure gracefully. This is non-trivial engineering that most developers should not rebuild from scratch.

Choosing a Model

The “best” local LLM depends on your hardware and use case. Here are practical recommendations:

For Chat and General Assistance

HardwareRecommended ModelWhy
8 GB laptopLlama 3.2 3B Q4_K_MBest quality at this memory budget
16 GB laptopLlama 3.1 8B Q4_K_MBest general-purpose local model
32 GB desktopLlama 3.1 70B Q4_K_MCloud-competitive quality
Mobile (flagship)Llama 3.2 1B Q4_K_MBest mobile chat model

For Code

HardwareRecommended ModelWhy
16 GB laptopCodeLlama 7B Q4_K_M or DeepSeek Coder 6.7BTrained specifically for code generation
32 GB desktopCodeLlama 34B Q4_K_MClosest to cloud code assistants

For Summarization and Extraction

Smaller models (1-3B) handle summarization and information extraction well because the task is largely extractive rather than generative. A Llama 3.2 1B can summarize a document nearly as well as a 70B model for straightforward summarization tasks.

Performance Expectations

Realistic token generation speeds on common hardware:

ModelHardwareQuantizationTokens/secExperience
Llama 3.2 1BiPhone 15 ProQ4_K_M15-25Usable chat
Llama 3.2 3BM2 MacBook AirQ4_K_M30-50Fluent
Llama 3.1 8BM3 MacBook ProQ4_K_M25-40Fluent
Llama 3.1 8BRTX 4090Q4_K_M80-120Instant
Llama 3.1 70BM3 Max (48 GB)Q4_K_M8-15Readable
Llama 3.1 70BRTX 4090 (24 GB)Q4_K_M15-25 (partial offload)Usable

For conversational use, 10+ tokens/second feels responsive. Anything above 20 tokens/second feels fluent. Below 5 tokens/second, users notice the delay and it starts to feel sluggish.

Getting Started (5 Minutes)

The fastest path from zero to running a local LLM:

Option 1: Ollama (Simplest)

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama run llama3.2

# Or run as an API server
ollama serve
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Hello"}'

Option 2: llama.cpp (Most Control)

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j

# Download a model (GGUF format from HuggingFace)
# Then run:
./llama-cli -m path/to/model.gguf -p "Hello, how are you?" -n 256

Option 3: In a Mobile or Desktop App

For embedding a local LLM in your own application, you need an SDK that handles model management, platform-specific compilation, and runtime abstraction. This is more involved than running a CLI tool, but it gives you full control over the user experience.

The general pattern:

  1. Bundle or download the GGUF model on first launch
  2. Load the model into memory using llama.cpp (or an abstraction over it)
  3. Run inference through the library’s API
  4. Stream tokens to your UI as they are generated

Limitations to Know

Context length is memory-bound. A longer context window requires more memory. A 7B model with a 4K context fits in 6 GB. The same model with a 32K context needs 10+ GB. If memory is tight, reduce the context length.

First-token latency is real. The first token takes longer than subsequent tokens because the model must process the entire prompt (prefill). For a 2,000-token prompt on a 7B model, expect 1-3 seconds before the first token appears. Subsequent tokens stream at the speeds listed above.

Not all tasks benefit. Tasks that require vast world knowledge, multi-step reasoning chains, or real-time information retrieval still favor larger cloud models. Local LLMs excel at focused tasks: summarization, extraction, code generation, conversational Q&A, and text transformation.

Quantization has limits. Below Q3, most models degrade noticeably. Q4_K_M is the sweet spot. If a model does not perform well at Q4, try the next size up rather than increasing the quantization quality of the same model.

What Is Next for Local LLMs

The trajectory is clear: models are getting smaller and more capable. Each generation of hardware ships with more NPU/Neural Engine compute. The models that required a desktop last year run on a phone this year.

Within the next year, expect:

  • 3B parameter models that match today’s 7B quality
  • Dedicated NPU inference paths on iOS and Android for LLMs
  • Better speculative decoding for faster generation on constrained hardware
  • On-device fine-tuning for personalization

The gap between cloud and local model quality is closing. For the majority of practical AI features — chat, summarization, extraction, code assistance, voice interaction — local inference is already good enough. It will only get better.


Related

Related articles

· 3 min read

Run AI Models On-Device — Zero Config, Five Minutes

CLI, Rust, Flutter, Swift, Kotlin, Unity — run 25+ ML models on-device with one command. No tensor shapes, no preprocessing scripts.

on-device-airun-ml-locallyrust-ml
· 8 min read

Add Text-to-Speech to Your Flutter App in 15 Minutes

A step-by-step guide to adding high-quality, on-device TTS to a Flutter app using Xybrid and the Kokoro model. No cloud APIs, no API keys, no per-request costs.

flutterttstutorial
· 12 min read

On-Device AI: The Complete Guide to Running ML Models Locally

Everything you need to know about running machine learning models directly on mobile and desktop devices — privacy, latency, cost benefits, and how to get started.

on-device-aiedge-inferencemobile-ml