Why Ollama and llama.cpp crawl when models spill into RAM, and how to fix it

A practical guide to faster local AI: fit models in VRAM, tame context length, cut parallelism, and avoid silent CPU fallback.

Mar 16, 2026

Slow first token, crawling replies, mixed CPU/GPU pain. Here’s what VRAM spill really does and how to fix local LLM performance. © Popular AI

Local inference sounds simple on paper. Download a model, point Ollama or llama.cpp at your GPU, and start chatting. Then the trap shows up. The model loads, but replies dribble out one token at a time, the first token takes forever, and a setup that looked fine on paper suddenly feels unusable in real life. That gap between “it runs” and “it works” is the whole story here.

Recent user complaints capture the pattern. In a LocalLLaMA thread from a CPU-only user, the question is basically whether running in RAM has any downside beyond awful speed. That framing is revealing. People are not asking whether local inference is possible. They are asking why local AI performance falls off a cliff once VRAM runs out.

That question matters because local inference is supposed to buy control. Running models on your own machine means fewer rate limits, less dependence on hosted APIs, and more freedom to choose your own tools. When a local stack becomes painful the minute a model spills out of VRAM, people drift back to rented inference. Some of that is unavoidable hardware physics. Some of it comes from defaults that hide what is really happening. Either way, the user experience is the same. Local AI starts to feel like a science experiment instead of a dependable tool.

The good news is that this problem is not mysterious. Once you separate prompt processing from token generation, and once you understand what changes when weights or the KV cache leave fast memory, the weird behavior starts to make sense. The fixes also get a lot less magical.

More on local AI

Run an AI agent on your own machine: a hands-on look at VIKI

Popular AI

Feb 20

Read full story

What spilling into RAM actually does to local inference

When local inference is fast, the happy path is simple. Model weights, the KV cache, and the working set stay on the GPU, inside fast local memory. That is the regime modern desktop GPUs are built for. In NVIDIA’s Ada architecture paper, the RTX 4090 is described as offering 1 TB/sec of peak memory bandwidth, which tells you why GPU-resident workloads can feel snappy.

Once part of the model or KV cache spills into system RAM, you are no longer living on that fast path. Ollama’s FAQ is blunt about a related point. If a model fits entirely on one GPU, Ollama prefers that because it usually delivers the best performance and reduces data transfer across the PCI bus. That one detail explains a lot of real-world pain. The slowdown is not just the same thing happening a bit more slowly. The machine has shifted into a worse performance regime.

In practice, that means every token can start paying extra costs for host memory access, data movement, and coordination between CPU and GPU. That is why mixed VRAM and RAM setups often feel dramatically worse than the spec sheet suggests. Capacity might be technically sufficient, but usable bandwidth and latency are telling a different story.

This is also why the cliff feels sharper than people expect. Going from a full-VRAM run to a partly CPU-backed run is not a gentle slide. It changes where the bottleneck lives. A model that felt responsive can turn sticky almost instantly once context growth, parallel requests, or an oversized quant pushes it beyond a clean GPU fit.

Prompt processing and token generation do not fail the same way

One reason local inference feels confusing is that there are really two workloads hiding under the same chat window. A recent llama.cpp discussion on optimization lays it out cleanly. Prompt processing feeds the whole prompt through the model, while token generation emits one token at a time. Those phases stress different parts of the machine.

Prompt processing is usually compute-bound. Token generation is usually memory-bound. That distinction explains why one rig can have decent tokens per second once generation starts, yet still feel awful because time to first token is terrible. It also explains why some users with plenty of GPU compute still complain that long prompts feel sticky and sluggish.

You can see the older version of the same complaint in an earlier llama.cpp discussion about long prompts. Users were trying to understand why starting with a long prompt felt so much slower than they expected. That is not a historical footnote. It is still one of the main reasons people think a model is broken when the real issue is that prompt ingestion and decoding are bottlenecked differently.

This distinction matters for diagnosis. If your local AI box stalls before the first token, the problem may be prompt processing, context size, or a long repeated system prompt. If it starts replying quickly but then crawls, the issue is more likely generation bandwidth, KV cache pressure, or spill from VRAM into RAM. Treating both symptoms as one problem leads to bad fixes and lots of wasted time.

Why mixed CPU and GPU inference feels so bad

A lot of users assume a mixed CPU and GPU setup should be close enough. After all, some part of the workload is still on the GPU. In practice, mixed inference often disappoints because it loses the main benefit of clean GPU residency.

Ollama’s FAQ on model loading across GPUs makes the priority clear. If a model fits on a single GPU, that is usually the fastest path because it avoids extra PCI bus traffic. Only when a model cannot fit on one GPU does Ollama spread it across multiple GPUs. That tells you what the software itself is optimizing for. It is not just raw memory capacity. It is reduced movement.

The same logic applies when a model spills into system RAM. Once weights or cache state start bouncing between GPU memory and host memory, performance takes a hit that users often experience as sudden and irrational. It is not irrational. The system is paying for a slower route every time it needs data that is not sitting where the GPU wants it.

Share Popular AI

This is also why “two GPUs will fix it” is only sometimes true. More hardware can solve capacity problems. It does not guarantee speed. If the box has to shuffle work across devices or fall back to system memory under pressure, the improvement can be much smaller than expected.

Context length and parallelism quietly eat your memory

For a lot of local AI users, model size gets blamed for everything. Model size matters, but it is only part of the story. The quieter problem is that context length and parallelism can burn through memory fast enough to turn a previously smooth setup into a miserable one.

Ollama’s context-length documentation says the default context window scales with available VRAM, from 4k on smaller VRAM setups to 32k and 256k on larger ones. The same page warns that increasing context length raises memory requirements and specifically tells users to verify model offloading with ollama ps. That warning deserves more attention than it gets. A model can feel fast at one context size and terrible at another, even when everything else stays the same.

The Ollama FAQ adds the other half of the trap. Required RAM scales with OLLAMA_NUM_PARALLEL * OLLAMA_CONTEXT_LENGTH. That means a personal workstation can accidentally behave like an overloaded shared server if parallelism is left too high. Users often think the model suddenly became slow. In reality, the machine is reserving far more KV cache than they realized.

This is why borderline fits are so fragile. A setup that feels fine during light testing can collapse once you raise OLLAMA_CONTEXT_LENGTH, leave multiple chats open, or run tools that push larger prompts through the model. Nothing mysterious happened. The working set grew past the point where fast memory could hold it comfortably.

Search engines should understand this page for what it is, so it is worth stating plainly. Local inference slows down when the model, the KV cache, or both no longer fit cleanly in VRAM. That is the heart of the VRAM-to-RAM spill problem, whether you are using Ollama, llama.cpp, or another local LLM stack built on similar constraints.

CPU-only inference can work, but it changes the kind of work that feels good

CPU-only inference is not fake inference. Ollama supports CPU execution, and llama.cpp was built in large part to make local models accessible on ordinary hardware. But “possible” and “pleasant” are very different standards.

That difference shows up clearly in the LocalLLaMA thread from the document. Users basically say the same thing many people discover the hard way. Yes, you can run bigger models in RAM. No, the experience is usually not great for interactive work. The real pain often shows up in prompt processing, not just raw decode speed.

The boring fixes that usually work best

The most effective fixes for slow local inference are usually the least glamorous ones. They start with fit. If you can choose between a larger model file that only runs by spilling into RAM and a slightly smaller quantized file that stays entirely in VRAM, the smaller file often wins in real work. The llama.cpp quantization docs make the tradeoff explicit. Quantization reduces model size and can speed inference, though it may cost some accuracy.

That trade is often worth it. A model you can fully keep on the GPU is usually more useful than a smarter one that technically loads but turns every interaction into a wait.

Next, cut context before you blame the model. Ollama’s context-length guide spells out that larger context needs more memory, and the FAQ explains how easily memory use compounds when OLLAMA_NUM_PARALLEL is greater than 1. On a personal machine, setting OLLAMA_NUM_PARALLEL=1 is often the simplest speed fix you can make.

Then look at KV cache memory pressure. Ollama’s FAQ section on Flash Attention and KV cache quantization says OLLAMA_FLASH_ATTENTION=1 can significantly reduce memory use as context grows. It also explains that OLLAMA_KV_CACHE_TYPE=q8_0 uses about half the memory of f16, while q4_0 uses about a quarter with a larger quality tradeoff. Those settings are not magic, but they are very real levers when a workload is right on the edge of fitting.

It is also worth checking that the GPU path is actually alive. Ollama’s troubleshooting guide warns that outdated AMD ROCm drivers on Linux can stall GPU discovery and push the server into CPU fallback. That is the kind of issue that makes people think a model or quant suddenly got worse, when the real problem is that the hardware path changed underneath them.

The fastest diagnostic in Ollama is still ollama ps. According to the FAQ and the context-length page, the PROCESSOR column tells you whether the model is on 100% GPU, 100% CPU, or a mixed split. If you expected a clean GPU fit and see a CPU/GPU split instead, you have already found the story.

For llama.cpp, the same principle applies. Keep an eye on model fit, offload, and cache settings. The project’s completion README documents flags such as -fa for Flash Attention, -ctk and -ctv for KV cache types, -np for parallel decoding, and –mlock to keep the model in RAM instead of letting the operating system swap or compress it. Small configuration choices can be the difference between a usable local setup and a painful one.

How to diagnose the bottleneck in five minutes

The simplest way to troubleshoot local inference is to stop guessing and separate prompt speed from generation speed.

If you are using Ollama, start with ollama ps. That tells you where the model actually loaded and how much context was allocated. Then lower OLLAMA_CONTEXT_LENGTH and set OLLAMA_NUM_PARALLEL=1 before you change anything more exotic. Those two settings eliminate a surprising amount of accidental memory pressure.

If you are using llama.cpp, the server README gives you a better measurement path. The /metrics endpoint exposes llamacpp:prompt_tokens_seconds and llamacpp:predicted_tokens_seconds, which lets you see whether the pain is in prompt processing, token generation, or both. That matters because the fix for slow prefill is not always the fix for slow decode.

The same server documentation also explains cache_prompt=true, which reuses KV cache when a new prompt shares a common prefix with the previous one. That is one of the most practical speedups for iterative work. If you keep reusing a long system prompt, a template, or a repeated instruction block, cache reuse can save you from paying the full prompt-processing cost every time.

For command-line workflows, the completion tool docs are worth a look because they document performance-related flags directly. Settings like –-perf, -np, -fa, -ctk, -ctv, and -–mlock are not obscure tweaks for hobbyists. They are the knobs that tell llama.cpp how aggressively to use memory, how many parallel sequences to decode, and how much pressure to put on the system.

The practical order is simple. First, verify fit. Second, shrink context. Third, reduce parallelism. Fourth, enable memory-saving options like Flash Attention and KV cache quantization if your quality tolerance allows it. Fifth, confirm that your GPU did not silently drop out of the picture. Most local inference slowdowns become much easier to explain once you follow that sequence.

Share Popular AI

Why observability matters for local AI

There is no conspiracy in the fact that VRAM is faster than system RAM. That part is just physics. But there is a product problem when local AI tools hide mixed CPU/GPU splits, scale memory usage through defaults most users never notice, or fall back to CPU with weak warnings.

That matters because bad visibility makes local inference look worse than it really is. If a user thinks a model is just slow, they never learn whether the culprit was context length, KV cache growth, cross-device transfer, or a dead GPU path. Better observability makes self-hosting less frustrating and a lot more practical.

In that sense, local AI performance is not only a hardware story. It is also a tooling story. Clear reporting around VRAM fit, offload behavior, context allocation, and fallback state does almost as much for user confidence as raw throughput does.

Local inference gets better once you match the tool to the job

The core reason local inference crawls when models spill into RAM is straightforward. Fast local AI depends on keeping the working set in fast memory. When weights or KV cache stop fitting in VRAM, the machine starts leaning on slower paths, and the experience changes fast.

Run an AI agent on your own machine: a hands-on look at VIKI

Comments

Ready for more?