You do not need a cloud API key to use a large language model. Models like Llama 3.2, Qwen 3.5, Phi-3, and Gemma 2 run on consumer hardware — your laptop, your desktop, even your phone. No internet required, no per-token billing, no data leaving your machine.
This guide covers everything you need to start running LLMs locally: which models fit on which hardware, the major runtimes and tools available, how to choose between them, and what performance to expect.
Why Run LLMs Locally
Four reasons drive most people to local LLM inference:
Privacy. Your prompts and conversations stay on your machine. No data is transmitted to OpenAI, Anthropic, Google, or any other provider. For personal use, this means your journal entries, medical questions, and private thoughts never leave your device. For enterprise use, it means proprietary code, internal documents, and customer data stay within your infrastructure.
Cost. Cloud LLM APIs charge per token. At scale, this adds up quickly. Local inference has a one-time cost (downloading the model) and zero marginal cost per token. If you run hundreds of queries per day, local inference pays for itself within the first week.
Latency. Local inference eliminates network round-trips. For applications that need real-time responses — voice assistants, code completion, interactive chat — removing 100-300ms of network overhead makes a noticeable difference.
Control. You choose the model, the quantization level, the system prompt, and the inference parameters. No rate limits, no content filtering you did not ask for, no surprise model deprecations. The model runs when you tell it to, exactly how you configured it.
Hardware Requirements
The constraining factor for local LLMs is memory. A language model must fit entirely in RAM (or VRAM) during inference. Here is what you need:
Desktop and Laptop
| Model Size | RAM Required (Q4 quantized) | Example Models | Hardware |
|---|---|---|---|
| 0.5-1B params | 1-2 GB | Qwen 3.5 0.8B, SmolLM2 | Any modern laptop |
| 1-3B params | 2-3 GB | Llama 3.2 1B, Llama 3.2 3B, Phi-3 Mini | Any laptop with 8 GB RAM |
| 7-8B params | 5-6 GB | Llama 3.1 8B, Mistral 7B, Gemma 2 9B | Laptop with 16 GB RAM |
| 13B params | 8-10 GB | Llama 2 13B, CodeLlama 13B | Laptop with 16+ GB RAM |
| 30-34B params | 20-24 GB | CodeLlama 34B, Yi 34B | Desktop with 32 GB RAM or GPU with 24 GB VRAM |
| 70B params | 40-48 GB | Llama 3.1 70B | High-end workstation or multi-GPU setup |
Apple Silicon (M1/M2/M3/M4) is particularly good for local LLMs because the unified memory architecture means CPU and GPU share the same RAM pool. A MacBook with 16 GB of unified memory can run 7-8B models fluently in the GPU.
NVIDIA GPUs with VRAM offload the entire model to the GPU for maximum speed. An RTX 4090 (24 GB VRAM) handles up to 13B models comfortably at full GPU speed, or 30B+ models with partial CPU offload.
CPU-only works for models up to 13B with acceptable speed (5-15 tokens/second). Beyond that, an integrated or discrete GPU is strongly recommended.
Mobile Devices
| Device Tier | Available Memory | Max Model Size | Example Models |
|---|---|---|---|
| Flagship 2024+ (iPhone 15 Pro, Pixel 8 Pro) | 6-8 GB total (~3-4 GB for models) | 1-3B Q4 | Llama 3.2 1B, Qwen 3.5 0.8B |
| Mid-range 2024+ | 4-6 GB total (~2-3 GB for models) | 0.5-1B Q4 | SmolLM2, Qwen 3.5 0.8B |
| Older / budget phones | Under 4 GB total | Not recommended | — |
Mobile inference is real but constrained. A 1B parameter model quantized to Q4 generates 10-20 tokens per second on a flagship phone — usable for chat, summarization, and simple question-answering. Models larger than 3B are impractical on current mobile hardware.
Model Formats
Local LLM models come in several formats. The format determines which runtimes can load the model and what optimizations are available.
GGUF
The dominant format for local LLM inference. GGUF files are self-contained — the model weights, tokenizer, and metadata are in a single file. llama.cpp, Ollama, LM Studio, and most local LLM tools use GGUF.
GGUF supports multiple quantization levels:
| Quantization | Bits Per Weight | Quality | Speed | Use Case |
|---|---|---|---|---|
| Q2_K | ~2.5 | Low | Fastest | Experimentation only |
| Q4_K_M | ~4.5 | Good | Fast | Recommended default |
| Q5_K_M | ~5.5 | Very good | Moderate | When quality matters more than speed |
| Q6_K | ~6.5 | Near-FP16 | Slower | Quality-sensitive tasks |
| Q8_0 | 8 | Excellent | Slow | When you have the memory |
| FP16 | 16 | Full | Slowest | Benchmarking only |
Start with Q4_K_M. It offers the best balance of quality, speed, and memory usage for most models and most use cases.
ONNX
Used by ONNX Runtime for cross-platform inference. ONNX models can leverage platform-specific accelerators (CoreML on Apple, DirectML on Windows, NNAPI on Android). Less common for LLMs but well-suited for smaller models and non-LLM tasks (ASR, TTS, embeddings, classification).
SafeTensors
The native format for Hugging Face models. Used by transformers, Candle (Rust), and other Python-ecosystem tools. Not directly loadable by llama.cpp or Ollama — you need to convert to GGUF first.
Runtime Options Compared
Several tools exist for running LLMs locally. They serve different use cases.
llama.cpp
The foundational project for local LLM inference. A C/C++ library with no dependencies that runs on CPU, CUDA, Metal, Vulkan, and SYCL.
Best for: Developers who need maximum control, custom integrations, embedding in other applications, mobile deployment.
Strengths:
- Fastest CPU inference of any runtime
- Metal acceleration on Apple Silicon
- CUDA acceleration on NVIDIA GPUs
- Used as the backend by Ollama, LM Studio, and most other tools
- C API for embedding in any language
Tradeoffs:
- No GUI — command-line and API only
- Compiling from source for optimal performance
- Configuration requires understanding quantization and inference parameters
Ollama
A user-friendly wrapper around llama.cpp. Install, ollama pull llama3.2, ollama run llama3.2. That is it.
Best for: Getting started quickly, running models as a local API server, replacing cloud API calls with a local endpoint.
Strengths:
- Simplest installation and model management
- OpenAI-compatible API server (drop-in replacement for many applications)
- Model library with one-command downloads
- Good defaults for most parameters
Tradeoffs:
- Less control over inference parameters than raw llama.cpp
- Abstracts away model format details (harder to use custom GGUF files)
- Desktop only — no mobile support
LM Studio
A desktop application with a GUI for browsing, downloading, and chatting with local models.
Best for: Non-developers, experimentation, evaluating different models quickly.
Strengths:
- Visual interface for model management
- Built-in chat UI
- Local API server (OpenAI-compatible)
- Model discovery and download from HuggingFace
Tradeoffs:
- Desktop only (macOS, Windows, Linux)
- Not embeddable in other applications
- Closed source
Embedding in Mobile and Cross-Platform Apps
For shipping local LLMs inside an application (mobile, desktop, or embedded), you need a runtime that can be linked as a library, not run as a separate process.
This is where llama.cpp’s C API becomes essential. It can be compiled for iOS, Android, macOS, Windows, and Linux, then called from your application code through FFI bindings. The alternative is ONNX Runtime, which provides similar cross-platform coverage but is more commonly used for non-LLM models.
The challenge is packaging: you need to bundle the model with your app or download it on first launch, manage model storage on constrained devices, and handle memory pressure gracefully. This is non-trivial engineering that most developers should not rebuild from scratch.
Choosing a Model
The “best” local LLM depends on your hardware and use case. Here are practical recommendations:
For Chat and General Assistance
| Hardware | Recommended Model | Why |
|---|---|---|
| 8 GB laptop | Llama 3.2 3B Q4_K_M | Best quality at this memory budget |
| 16 GB laptop | Llama 3.1 8B Q4_K_M | Best general-purpose local model |
| 32 GB desktop | Llama 3.1 70B Q4_K_M | Cloud-competitive quality |
| Mobile (flagship) | Llama 3.2 1B Q4_K_M | Best mobile chat model |
For Code
| Hardware | Recommended Model | Why |
|---|---|---|
| 16 GB laptop | CodeLlama 7B Q4_K_M or DeepSeek Coder 6.7B | Trained specifically for code generation |
| 32 GB desktop | CodeLlama 34B Q4_K_M | Closest to cloud code assistants |
For Summarization and Extraction
Smaller models (1-3B) handle summarization and information extraction well because the task is largely extractive rather than generative. A Llama 3.2 1B can summarize a document nearly as well as a 70B model for straightforward summarization tasks.
Performance Expectations
Realistic token generation speeds on common hardware:
| Model | Hardware | Quantization | Tokens/sec | Experience |
|---|---|---|---|---|
| Llama 3.2 1B | iPhone 15 Pro | Q4_K_M | 15-25 | Usable chat |
| Llama 3.2 3B | M2 MacBook Air | Q4_K_M | 30-50 | Fluent |
| Llama 3.1 8B | M3 MacBook Pro | Q4_K_M | 25-40 | Fluent |
| Llama 3.1 8B | RTX 4090 | Q4_K_M | 80-120 | Instant |
| Llama 3.1 70B | M3 Max (48 GB) | Q4_K_M | 8-15 | Readable |
| Llama 3.1 70B | RTX 4090 (24 GB) | Q4_K_M | 15-25 (partial offload) | Usable |
For conversational use, 10+ tokens/second feels responsive. Anything above 20 tokens/second feels fluent. Below 5 tokens/second, users notice the delay and it starts to feel sluggish.
Getting Started (5 Minutes)
The fastest path from zero to running a local LLM:
Option 1: Ollama (Simplest)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama run llama3.2
# Or run as an API server
ollama serve
curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Hello"}' Option 2: llama.cpp (Most Control)
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# Download a model (GGUF format from HuggingFace)
# Then run:
./llama-cli -m path/to/model.gguf -p "Hello, how are you?" -n 256 Option 3: In a Mobile or Desktop App
For embedding a local LLM in your own application, you need an SDK that handles model management, platform-specific compilation, and runtime abstraction. This is more involved than running a CLI tool, but it gives you full control over the user experience.
The general pattern:
- Bundle or download the GGUF model on first launch
- Load the model into memory using llama.cpp (or an abstraction over it)
- Run inference through the library’s API
- Stream tokens to your UI as they are generated
Limitations to Know
Context length is memory-bound. A longer context window requires more memory. A 7B model with a 4K context fits in 6 GB. The same model with a 32K context needs 10+ GB. If memory is tight, reduce the context length.
First-token latency is real. The first token takes longer than subsequent tokens because the model must process the entire prompt (prefill). For a 2,000-token prompt on a 7B model, expect 1-3 seconds before the first token appears. Subsequent tokens stream at the speeds listed above.
Not all tasks benefit. Tasks that require vast world knowledge, multi-step reasoning chains, or real-time information retrieval still favor larger cloud models. Local LLMs excel at focused tasks: summarization, extraction, code generation, conversational Q&A, and text transformation.
Quantization has limits. Below Q3, most models degrade noticeably. Q4_K_M is the sweet spot. If a model does not perform well at Q4, try the next size up rather than increasing the quantization quality of the same model.
What Is Next for Local LLMs
The trajectory is clear: models are getting smaller and more capable. Each generation of hardware ships with more NPU/Neural Engine compute. The models that required a desktop last year run on a phone this year.
Within the next year, expect:
- 3B parameter models that match today’s 7B quality
- Dedicated NPU inference paths on iOS and Android for LLMs
- Better speculative decoding for faster generation on constrained hardware
- On-device fine-tuning for personalization
The gap between cloud and local model quality is closing. For the majority of practical AI features — chat, summarization, extraction, code assistance, voice interaction — local inference is already good enough. It will only get better.
Related
- On-Device AI: The Complete Guide to Running ML Models Locally — broader guide covering all model types, not just LLMs.
- Edge AI vs Cloud AI: When to Run Models On-Device — decision framework for when local inference makes sense vs cloud.