How to choose the right local LLM for 8GB, 12GB, and 24GB VRAM
Find the best local LLM for limited hardware, from 8GB laptops to 24GB GPUs, with practical advice on context, quantization, and fit.

Running a local model sounds wonderfully simple. One box. One model. No API bill. No usage cap. No surprise account lockout.
Then the real world shows up.
The model that looked unbeatable on a leaderboard starts crawling at four tokens per second. It spills out of VRAM into system RAM. Long context turns sluggish. Vision support disappears because your runtime does not load the right multimodal pieces. That is why threads like this LocalLLaMA discussion about the best 24GB model in 2026 keep resurfacing. People are not really asking for the “best” model. They are asking which model will stay fast, sharp, and usable on the hardware they already own.
That confusion is understandable because the menu has grown fast. Google’s Gemma 3 launch post laid out a family that ranges from 1B to 27B, with multimodal support in the larger sizes. Qwen3.5 landed in a rapid series of releases across multiple sizes and mixtures of experts. Mistral added another major option with Mistral Small 4, while earlier lines like Mistral Small 3.1 and Devstral Small stayed relevant for people trying to make one GPU do real work every day.
More choice should make life easier. In local AI, it often does the opposite.
The core mistake is mixing up “good model” with “good fit.” A model can be excellent on paper and still be miserable on your machine. So the right way to choose a local model in 2026 is to start with the task, then match it to your VRAM, then leave enough headroom for context, cache, and runtime overhead.
That last part matters more than most people expect.
More on local AI
Why local model choices keep going wrong
Most bad local-model decisions come from a handful of recurring assumptions.
The first is benchmark worship. Benchmarks can be useful, but they do not tell you how a model will feel in your runtime, at your quantization level, on your context length, and on your GPU. The Mistral Small 3.1 model card makes that pretty clear by showing long-context results that differ sharply from other models in the same general class. A big context number on the front of the box does not guarantee the same quality at the far edge of that window.
The second mistake is believing active parameters tell the whole story. They do not. The Hugging Face explainer on mixture-of-experts models is still the cleanest explanation of why people get burned here. Sparse MoEs can act like smaller models per token, but they still need the experts resident in memory. That is why a model with modest active compute can still demand workstation-class VRAM.
The third mistake is treating long context like a free upgrade. It is never free. Ollama says as much in its context-length documentation. Bigger context means bigger memory use, which is why the platform defaults are conservative below higher VRAM tiers. That is also why a model that “supports” 128K or 256K context can feel completely different once you try to run it on a 12GB or 24GB card in the real world.
The fourth mistake is underestimating vision. Vision models are not just chat models with pictures taped on. The llama.cpp CLI docs expose that directly with options like -–mmproj and –-mmproj-offload, which tell you there is extra multimodal plumbing involved. Qwen’s own Qwen3.5 README also spells out that runtime support differs by platform and modality. If your stack is behind, your conclusion about a model may really be a conclusion about your tooling.
The fifth mistake is assuming runtimes are interchangeable. They are not. A model that behaves beautifully in one stack can be clumsy in another because of how the checkpoint is loaded, how experts are split, how attention is implemented, or how quantization is handled.
That is where a lot of user frustration comes from. And honestly, a lot of that frustration is deserved.
What actually fills memory on a local LLM setup
Every local setup has three big memory buckets: weights, KV cache, and runtime overhead.
The weights are the checkpoint itself. That is the part most people focus on first. Quantization helps because it shrinks those weights. Ollama’s FAQ on quantization gives the simple mental model many users need: q8_0 is about half the memory of f16, while q4_0 is roughly a quarter, with bigger tradeoffs as you push precision lower. That is a useful rule of thumb, but only as a starting point.
The KV cache is the second bucket, and it is the one that surprises people. Hugging Face’s cache explanation and its separate KV cache documentation both make the same point from slightly different angles. Caching saves compute because the model does not recompute attention from scratch every step. But the cache itself can become a serious memory bottleneck, especially as context grows. That is why quantized caches exist, and also why quantized caches are not always a free lunch.
Then there is runtime overhead. That includes what your inference stack needs in order to schedule work, manage tensors, keep multimodal components loaded, and avoid constant paging to CPU memory. This is also where “fits in VRAM” can turn into “technically loaded, practically unusable.”
That distinction matters a lot. If the weights load but the runtime has to squeeze context, quantize cache aggressively, or spill enough data to CPU that throughput collapses, your model does not really fit in any way that matters.
Checkpoint size tells you more than marketing does
One of the most useful clues in local AI is also one of the least glamorous: raw checkpoint size.
Look at the full repositories for Qwen3.5-27B and Qwen3.5-122B-A10B. Those file sizes tell you far more about likely hardware demands than the active-parameter headline alone. The same logic applies to models like Mistral Small 4, where the frontier appeal is obvious but the hardware implications are just as obvious once you look at the actual checkpoint.
This is also why runtime implementation can make or break a model choice. A vLLM issue report on Qwen3-Coder-Next FP8 quantization shows how the “same model” can behave very differently depending on the checkpoint path and how the runtime splits expert weights. In practice, that can turn a model from unrealistic to practical, or the other way around.
So when people ask what should run on a given card, they should stop staring at one line on the model card and start asking a more grounded question: how large is the checkpoint, how aggressive is the quantization, how big is the context, and how much headroom does the runtime need?
That is the real hardware story.
What to run on 8GB VRAM
At 8GB, the goal is not to squeeze out every last benchmark point. The goal is to stay inside the fast lane and avoid self-inflicted pain.
For general chat, the Qwen 3 tags in Ollama show why smaller Qwen variants remain attractive. They land in the kind of package sizes that make sense on this tier. For multimodal users, the Gemma 3 12B blob page in Ollama is also a useful reminder that even when a model can be squeezed onto limited hardware through quantization, that does not mean it is the easiest or happiest starting point.
This is where many people waste time. They try to bully mid-size models into small cards, then conclude local AI is overrated. Usually the better move is to choose a smaller model that stays fully GPU-resident, responds quickly, and gives you enough headroom for a sane context length.
For coding, 8GB is still the compromise tier. You can do patch-sized help, code explanation, targeted edits, and smaller refactors. What you should not expect is effortless repo-scale agent behavior from heavyweight coding models. The existence of Devstral Small and larger coding-first Qwen variants is exciting, but it also underlines the limit. Serious open coding models are trending upward in size, and 8GB users need to be more selective than ever.
In practical terms, 8GB is where discipline wins. Pick a competent small model. Keep context conservative. Skip oversized ambition.
What to run on 12GB VRAM
Twelve gigabytes is the first tier where local AI starts feeling useful on a daily basis instead of constantly precarious.
This is where 12B to low-teens models become genuinely interesting. The Gemma 3 12B Ollama artifact page makes clear why the model keeps coming up in these conversations. It is close enough to fit that a careful user can make it work, provided they do not act like the advertised maximum context is a free entitlement.
That is the main trap at 12GB. Users see 128K or 256K on a model card and assume they should dial it up immediately. Ollama’s context rules tell you why that is a bad instinct. Below 24 GiB VRAM, the defaults stay much lower for a reason. Cache pressure climbs fast, and once CPU spill starts eating into responsiveness, the setup stops feeling local and starts feeling broken.
If your task mix is general chat, light reasoning, document work, and moderate coding, 12GB can now be very workable. What it cannot do gracefully is pretend to be a long-context workstation. Respect the ceiling and it will reward you.
Ignore the ceiling and you will end up benchmarking misery.
What to run on 24GB VRAM
Twenty-four gigabytes is still the sweet spot. This is the tier where local AI stops being a hobby demo and starts becoming a serious daily-driver setup. A Hugging Face discussion on Gemma 3 27B IT is especially useful here because it translates the abstract model story into the concrete language users care about: a single RTX 3090-class card, an INT4 path, and enough space left for cache and runtime overhead. That is the kind of evidence that helps people choose wisely.
At this tier, strong general-purpose models become practical rather than merely tempting. The Mistral Small 3.1 card, the Qwen 3 tags page, and the same Gemma 3 27B discussion all point toward the same center of gravity. This is where dense 24B to 30B-class models and carefully packaged sparse options start making sense for real use.
For coding, 24GB is where you can stop living entirely in compromise mode. It opens the door to models like Devstral Small and larger Qwen coding variants in a way that feels practical, not heroic. For vision, it is enough to experiment seriously, though you still need to respect multimodal overhead and context costs.
If you want one sentence to guide buying decisions in 2026, here it is: 24GB is where local AI gets comfortable enough to be worth building around.

What to run on 48GB VRAM
Forty-eight gigabytes buys something people undervalue until they have it: headroom.
Headroom means longer context without panic. It means retrieval-heavy workflows feel less fragile. It means you are not operating right on the cliff edge every time you add documents, images, or a bigger cache. Ollama’s context defaults reflect that shift directly. Once you move into this tier, bigger windows stop being theoretical marketing and start becoming usable settings.
This is also where larger sparse models begin to feel sane. The Qwen3.5-35B-A3B license page matters here for a different reason too. Hardware fit is only one decision axis. The license can shape the long-term value of a model just as much as raw performance, especially for people building products or internal tooling.
Forty-eight gigabytes is also the first tier where frontier experiments stop looking ridiculous. Models like Mistral Small 4 are still demanding, and nobody should mistake them for casual desktop picks. But on 48GB, at least the conversation changes from “absolutely not” to “maybe, with the right checkpoint, quantization, and expectations.”
That is a meaningful shift.
What to run on 128GB VRAM
One hundred and twenty-eight gigabytes is where big local starts to get interesting, but it still does not mean infinite freedom.
This tier finally puts high-end sparse models on the table in a way that does not immediately collapse into absurd offload. The checkpoint scale of Qwen3.5-122B-A10B makes that obvious. So does the hardware profile suggested by Mistral Small 4. You are now shopping in the 100B-class conversation, but you are still shopping carefully.
That matters because “128GB” can sound like the end of all constraints. It is not. It is the point where the biggest sparse local models begin to feel practical without clownish levels of compromise. That is a huge step up. It just is not magic.
For most readers, this section is less about what they should buy and more about perspective. If 24GB is the serious sweet spot, 128GB is the beginning of the workstation frontier.
Why licensing belongs in the buying decision
People love to talk about weights, context, and tokens per second. They spend less time on licensing, which is a mistake.
The Qwen3.5-35B-A3B license file matters because it reminds readers that not every “open” local option gives you the same freedoms. The same is true across other model families. Hardware fit is only one part of the decision. Legal fit matters too.
If two models are close in quality, the less restrictive license can easily become the smarter long-term choice. That is especially true for commercial teams, internal enterprise deployments, and anyone who does not want to redesign their stack around licensing surprises later.
A model that runs well but creates legal friction is not really the best model for your workflow.
How to choose the right local model without wasting weeks
The best process is still very simple.
Start with the task, not the leaderboard. If your main job is chat and research, favor balanced instruct models. If your main job is coding, use coding-first models. If your main job is vision, make sure the runtime you actually use supports multimodal inference properly. The Qwen3.5 README is useful here because it connects model capability to platform-specific tooling instead of pretending support is universal.
Then choose the smallest model that fully fits on GPU with headroom. In Ollama, the quickest sanity check is ollama ps. That command shows what is running, how much VRAM is in play, and whether you are quietly spilling onto CPU. If a supposedly “great” model is half-offloaded and dragging, you do not need a bigger download. You need a smaller target.
Keep context conservative until the setup proves itself. Raise it only after the model is fully GPU-resident, responsive, and sharp enough for the work you care about. This is where a lot of users should slow down. Bigger numbers feel good in screenshots. Smooth systems feel better in real life.
When official or pre-quantized checkpoints exist, prefer them. Ollama’s model import docs make that path clearer, and they also show the tooling for direct quantization through commands like ollama create –quantize. For large sparse models in particular, choosing the clean checkpoint path can save a lot of needless pain.
Finally, treat maximum context claims as upper bounds. The architecture may support a huge window. Your machine may not support that window gracefully. Those are different statements, and readers who keep them separate will make much better decisions.
A practical way to think about VRAM in 2026
The cleanest mental model is still a tiered one.
Eight gigabytes is for competent small models and disciplined expectations. Twelve gigabytes is where daily local chat becomes genuinely useful. Twenty-four gigabytes is the real sweet spot for people who want a single machine to handle serious local work. Forty-eight gigabytes buys comfort, longer-context breathing room, and room for more ambitious workflows. One hundred and twenty-eight gigabytes is where the biggest sparse local models start becoming realistic without ridiculous compromise.
That framework is not flashy, but it is much closer to reality than the usual hype cycle.
A local model is only “right” when it fits four tests at once. It has to fit in memory, stay fast enough to use, stay sharp enough at your chosen context, and fit the license and runtime constraints of your actual workflow. Miss any one of those and the recommendation starts falling apart.
That is why this question never goes away. People think they are asking for the best model. What they really need is the best fit.
Further reading
For readers who want to go deeper into the technical plumbing, the llama.cpp CLI reference is one of the clearest places to see how multimodal options surface at runtime. The Hugging Face MoE explainer, the cache explanation, and the KV cache guide are all useful background if you want to understand why active parameters, context length, and memory use so often tell different stories.
If you are troubleshooting packaging and runtime behavior rather than model quality, the vLLM Qwen3-Coder-Next issue, the Ollama FAQ, and the Ollama context docs are good places to start. They make a simple point that most local AI guides still underplay: the checkpoint, the runtime, and the hardware tier all shape the answer.
Conclusion
The local-model question keeps coming back because it is really four questions hiding inside one. What fits. What stays fast. What stays sharp at your chosen context. What you are actually free to use.
Those answers overlap, but they are never identical.
In 2026, the clearest answer is still this: choose by task first, then by VRAM. Verify that the model is actually staying on the GPU. Be conservative with context until the setup proves itself. Respect the extra cost of vision. Do not let MoE marketing convince you that active parameters tell the whole memory story.
Do that, and choosing a local model gets much less confusing.
Maybe not simple. But finally manageable.
Explore more from Popular AI:
Start here | Local AI | Fixes & guides | Builds & gear | AI briefing



