1 Comment

User's avatar
JP's avatar

You have nailed the bit about agent benchmarks being a different sport. The gap between can it chat and can it finish the job is where most of my time goes now.

Practically speaking, I have been running Kimi K2.5 via Synthetic.new for agentic coding loops. 76.8 on SWE-bench Verified, flat $30/month, 135 messages per 5-hour window. For the kind of multi-step tool-calling work you are describing, the cost model matters almost as much as the benchmark scores because token anxiety changes how you use the tool. You hold back instead of letting it run.

Wrote up the full setup and cost comparison here https://reading.sh/how-to-get-3x-claude-rate-limits-for-30-a-month-1d3fdb8658df

The lock-in argument in your closing section is the part most people miss though. A working agent stack is way harder to move than a model.

No posts

Ready for more?