LLM Primer
Everything you need to know to run AI models on your own hardware.
What is a large language model?
A large language model (LLM) is a neural network trained to predict and generate text. It learns patterns from enormous amounts of data, which lets it answer questions, write code, summarise documents, and hold conversations. Unlike cloud AI services, open-weight models like Llama, Mistral, and Qwen can be downloaded and run entirely on your own hardware — no internet connection required during inference, no data leaving your machine.
Why VRAM is the bottleneck
A model's weights — the billions of learned numerical values that define its behaviour — must be loaded into GPU memory (VRAM) before it can run. There is no streaming from disk during generation; the full model must fit. A 7B parameter model at 16-bit precision occupies roughly 14 GB of VRAM, which is more than most consumer GPUs have. This is where quantization comes in.
Quantization: trading precision for size
Quantization compresses model weights by storing them at lower precision. Instead of a 16-bit float per weight, you might use 4 or 8 bits. The most widely used formats are:
- Q4_K_M — approximately 4 bits per weight. A 7B model shrinks to around 4.4 GB, fitting comfortably on an 8 GB GPU. Output quality is slightly reduced but usually indistinguishable from 16-bit for everyday tasks.
- Q8_0 — approximately 8 bits per weight. The same 7B model is about 7.7 GB. Quality is very close to full precision, at the cost of needing more VRAM.
Models on this site are stored in the GGUF format, and the VRAM figures shown are the actual file sizes — what you'll download and load into memory.
Memory bandwidth determines speed
Once a model fits in VRAM, how fast it generates tokens is almost entirely determined by your GPU's memory bandwidth — not its compute throughput. During each generation step, the GPU reads the entire set of model weights once. A GPU with 1000 GB/s bandwidth reading a 5 GB model can generate roughly 200 tokens per second. Compute cores sit idle most of the time during inference; the bottleneck is how quickly data can move between memory and the processor.
This is why an RTX 4090 (1008 GB/s) runs models significantly faster than a GPU with the same VRAM but half the bandwidth. It's also why Apple Silicon performs so well for local inference: unified memory means both CPU and GPU share the same high-bandwidth pool.
Context windows and the KV cache
The context window is the maximum amount of text a model can "see" at once — both your input and its own previous output. A 32K context window lets you feed in long documents or maintain extended conversations. However, storing the context requires additional VRAM in the form of a KV cache. The larger the context, the more VRAM it consumes — which is why a model might technically fit in your VRAM but only support a 4K context when you need 32K.
System RAM offloading
If a model is too large for your VRAM alone, runtimes like llama.cpp, Ollama, and LM Studio can offload some layers to system RAM, processing them on the CPU. This makes larger models technically runnable on modest hardware. The trade-off is speed: CPU memory bandwidth is typically 50–100 GB/s, versus 500–1000+ GB/s on a modern GPU. A model running with heavy offload can drop from 60+ tokens/sec to single digits.
Multi-GPU setups
Multiple GPUs pool their VRAM, letting you run models that no single card could hold. A pair of RTX 3090s gives you 48 GB of combined VRAM — enough for a 70B model at Q4_K_M. The bandwidth doesn't double cleanly though: PCIe interconnects between cards add overhead, so real-world throughput is typically 70–85% of the theoretical combined bandwidth. NVLink (available on some workstation cards) performs closer to the upper end of that range.
Choosing the right model
The right model depends on what you're doing. A few rules of thumb:
- For general chat and writing, Q4_K_M quality is fine. Use the largest model that fits.
- For coding tasks, prioritise models with a high SWE-Bench score — sort by "Coding" in the tool.
- For knowledge-heavy tasks (research, summarisation), MMLU score is a better proxy — sort by "General".
- If generation speed matters, aim for at least 15–20 tokens/sec. Below 10 tok/s feels noticeably slow.
- If you need long documents or extended context, filter by minimum context window first.