24GB VRAM threshold is accurate. Ran Qwen 32B locally on an M4 Mac Mini and it changes the conversation entirely. The M4 handles it better than expected without a dedicated GPU honestly. The interesting comparison is not local versus API anymore.
It is which tasks justify the latency tradeoff. Benchmarked a few current models including running 397B locally through SSD streaming. The gap between hosted and local has closed faster than I would have predicted even six months ago. Budget setups are suddenly viable for real work.
24GB VRAM threshold is accurate. Ran Qwen 32B locally on an M4 Mac Mini and it changes the conversation entirely. The M4 handles it better than expected without a dedicated GPU honestly. The interesting comparison is not local versus API anymore.
It is which tasks justify the latency tradeoff. Benchmarked a few current models including running 397B locally through SSD streaming. The gap between hosted and local has closed faster than I would have predicted even six months ago. Budget setups are suddenly viable for real work.