Qwen 3.5 vs the Desk Test: Why Local Coding Agents Still Fail

Why Qwen 3.5 looks strong in evals but breaks on your desk. A practical read on llama.cpp, tool calling, and local agent reliability.

Mar 21, 2026

Qwen 3.5 can score well on benchmarks, yet local coding agents still fail when llama.cpp, tool parsers, streaming, and file edits collide. © Popular AI

Qwen 3.5 and other open models keep posting serious benchmark numbers, and that part is real. The trouble starts when people assume those scores will carry cleanly into a coding agent running on a local machine. In practice, a local agent is never just the model. It is the model plus the chat template, the reasoning parser, the tool parser, the streaming path, the SDK contract, and the file-edit workflow. When even one layer disagrees with the others, users do not experience a strong model with a minor glitch. They get an agent that sounds sure of itself, skips the actual tool call, and then behaves as if it edited the file anyway.

That gap between benchmark strength and desk-level usefulness is the real story here. The article’s starting point is FeatureBench, which shows how quickly performance can collapse when agentic coding tasks become more realistic. In that benchmark, an agent that scored 74.4% on SWE-bench resolved only 11.0% of the feature-level tasks. AgentNoiseBench pushes the same warning from another angle. Cleaner training and evaluation setups can flatter an agent that later struggles in the messier environments real users actually run.

That broader warning lines up almost perfectly with what surfaced in a LocalLLaMA thread about running Qwen3.5 as an agent. The original poster could chat with Qwen3.5-35B through llama-server, --jinja, and zeroclaw, yet tool use would break at random. The symptoms were ugly and familiar. 400 and 500 errors, skipped calls, malformed output, and file writes or edits that never really happened. The same user also reported a side effect worth noticing. Moving from Windows 11 to Pop!_OS on the same unusual RTX 3070 + 5060 Ti setup boosted generation speed by nearly 50%. After a move from build b8220 to b8239, things seemed better for that user, but the rest of the thread stayed cautious. Other replies still described 35B as sketchy, with skipped tool calls, failed file edits, and even odd output like </thinking> instead of </think>.

More about local LLMs

How to choose the right local LLM for 8GB, 12GB, and 24GB VRAM

Popular AI

Mar 15

Read full story

Why protocol strength matters as much as model strength

The first lesson is simple. Benchmark strength does not guarantee protocol strength.

According to Qwen’s own function-calling documentation, the model family supports tool calling, including multi-step and parallel tool calls, but tool use is supposed to be handled through the intended template or through Qwen-Agent. The same documentation warns that stopword-based prompting styles such as ReAct are a poor fit for reasoning models like Qwen3 because the model can emit those stopwords inside its own reasoning process. Once that happens, tool calling can break for reasons that have nothing to do with the underlying capability of the model.

That warning matters more than it first appears. For Qwen3.5-35B-A3B, the official model card includes explicit launch examples for tool use in SGLang and vLLM, and those examples enable the qwen3_coder tool parser. That is a strong hint that the surrounding stack has to meet the model on its own terms. If your runner assumes every tool-capable model speaks the same OpenAI-shaped dialect, you can drift off course before the first file read, write, or edit ever starts.

Qwen’s tool format is more demanding than it looks

The second lesson is that Qwen’s tool path is not trivial plumbing. The current Qwen3.5 chat template uses explicit XML-style tool calls and also allows optional natural-language reasoning before the function call. In other words, the model can emit structure that looks more like this than like a plain JSON blob:

<tool_call>
  <function=write_file>
    <parameter=filePath>src/app.py</parameter>
    <parameter=content>...</parameter>
  </function>
</tool_call>

That matters because the parser on the other end has to recognize what the model actually emitted, not what the framework hoped it would emit. Qwen’s own Qwen3-Coder-Next model card says the coding-focused model was trained for real-world IDE and CLI agent use, with attention to executable tasks, environment interaction, and more diverse tool-call formats. That helps explain why one framework can look fine while another turns chaotic. The model may be flexible enough to cope with several tool styles, but the parser still has to understand the exact wrapper format that showed up on the wire.

Where llama.cpp became the trap

This is where llama.cpp turned into a trap for many Qwen 3.5 users.

One llama.cpp issue, #19872, documented a server-side 500 where the Qwen 3.5 template tried to iterate over tool_call.arguments|items, but the server had supplied a string instead of a mapping. Another issue, #20198, captured the reverse problem. llama-server was returning tool_calls[].function.arguments as a parsed JSON object even though OpenAI-compatible clients and the official OpenAI SDK expected a JSON string.

That may sound like a niche implementation detail, but it changes the user experience completely. The problem is no longer just model quality. The problem is that the model, template, and supposedly compatible server are disagreeing about what a valid tool call even looks like. Once that happens, users see a confident assistant that promises an edit and then leaves the file untouched.

A small example shows the kind of compatibility mismatch that matters here:

{
  "tool_calls": [
    {
      "function": {
        "name": "write_file",
        "arguments": "{\"filePath\":\"src/app.py\",\"content\":\"...\"}"
      }
    }
  ]
}

If the client expects the arguments field above to be a JSON string and the server sends an already parsed object instead, the whole chain can fail even when the model was otherwise on the right track.

Streaming, reasoning, and tools still make a volatile mix

The third lesson is that reasoning plus streaming plus tools is still a danger zone.

The vLLM Qwen3 reasoning parser notes explain that Qwen3.5 changed the chat template so <think> appears in the prompt and only </think> is generated. That sounds clean until tool calling enters the picture. As the article notes, a March 2026 llama.cpp issue described a failure mode where the model thought for a while, produced a short natural-language transition, and then emitted a valid <tool_call>. The grammar trigger recognized the tool call, but the later parser tried to parse the entire assistant message from the start. Because the message began with ordinary text instead of <tool_call>, parsing failed and the server streamed back a 500.

That lines up with a complaint local users keep repeating. It works when streaming is off. In many cases the model may have produced a perfectly usable call. The server simply choked on the wrapper state around it.

Great model scores do not guarantee a reliable local coding agent. Here is where Qwen 3.5, tool calls, and middleware go wrong. © Popular AI

Why file edits are where trust really breaks

File-edit failures fit the same pattern, and they hit much harder than a chat-only mistake because they break the one thing users actually opened the agent to do.

In llama.cpp issue #19382, users reported file-writing tool calls that duplicated filePath, broke JSON syntax, and looped on the same invalid request. The article also points to related cases where the model generated a natural-language preamble and then hit end-of-sequence before the actual tool call arrived. That is the desk-level failure people remember. The agent sounds competent, announces the action, and then never performs it.

That last point matters because conversational benchmarks can mask it. A model can already win the chat portion by explaining the right fix, describing the next step, or sounding persuasive while it reasons through the task. None of that counts if the tool boundary fails. For local coding agents, the real benchmark is whether the file changed correctly, whether the command actually ran, and whether the result survived the parser stack in between.

Share Popular AI

Why Qwen3-Coder-Next keeps coming up

This is also why Qwen3-Coder-Next kept surfacing in the thread as the better answer for local coding agents. The Reddit replies are still anecdotal, not a controlled benchmark, so they should be treated with caution. Even so, the direction makes sense.

Qwen positions Qwen3-Coder-Next as a model built specifically for coding agents and local development. Its description emphasizes executable tasks, environment interaction, and broader tool-format robustness. That does not make it magical, and it does not eliminate the open bug reports local users are still seeing. What it does suggest is that it is a more sensible default for agentic coding than assuming a benchmark-strong general model will automatically become a reliable desk agent inside whatever wrapper you already happen to use.

What users should actually do

The practical advice starts with the most boring fix. Update the runner first.

In the LocalLLaMA thread, moving from b8220 to b8239 helped immediately for the original poster. That matches the broader pattern described in the article, where releases around b8236 and b8239 included fixes tied to tool-call compatibility. Qwen’s guidance around llama.cpp also prefers recent releases while warning that the latest build can still contain bugs. For this model family, pinning a known-good build is often smarter than trusting whatever landed yesterday.

The next step is to use the parser the model actually expects. The official Qwen3.5-35B-A3B model card shows tool-use examples that enable qwen3_coder. That should be read as a requirement, not a suggestion. A generic OpenAI-compatible endpoint does not guarantee generic tool behavior. If your framework supports Qwen-Agent, that is even closer to the canonical path. The same idea applies to prompting style. Qwen’s docs warn against brittle stopword-based agent scaffolds for reasoning models, which is one reason ReAct-like setups can behave worse than expected even when the model itself looks strong.

Then reduce the failure surface while debugging. If streaming is on and tool calls keep breaking, turn streaming off for a test run. If reasoning text seems to interfere with tool boundaries, disable thinking where the stack supports it, such as enable_thinking=False. That move is only a debugging step. It removes one moving part so you can see whether the core tool path is healthy.

After that, inspect the wire format instead of trusting the transcript. Check whether tool_calls[].function.arguments is actually coming back as the JSON string your client expects. If you are seeing 500 errors mentioning items, template filter failures, or parser complaints around structured output, you are very likely dealing with a server-template compatibility problem rather than proof that Qwen cannot use tools. Verbose logs, rendered prompts, and raw responses will tell you more than the chat window ever will.

Finally, pick the model for the job instead of the screenshot. If your daily work is local coding with file reads, writes, and edits, Qwen3-Coder-Next is the safer bet right now than forcing Qwen3.5-35B into a stack that keeps skipping tool calls. If 35B keeps misbehaving on your hardware, the thread at least offers anecdotal support for trying a smaller Qwen3.5 variant or a different runner. The Pop!_OS speed bump should be read in the same practical way. Local compilation, backend choice, driver path, and packaging all matter. Big Linux gains are plausible without any mystery at all.

The real lesson for local agent builders

The deeper lesson is the one open-model enthusiasts sometimes learn the hard way. Open weights do not automatically give you a sovereign agent stack. The real control point is often the middleware sitting between the weights and the filesystem. That is where template defaults, parser bugs, SDK assumptions, and OpenAI-compatible shortcuts decide whether a local model is actually useful on your desk.

Benchmarks still matter. They tell you whether the model can reason, code, and solve structured tasks under cleaner conditions. But for people who want an agent that can open files, edit them, and leave behind correct changes, the benchmark screenshot is only the opening bid. The desk test is harsher and more honest. If the plumbing is bad, the benchmark win never reaches your files.

Explore more from Popular AI:

Start here | Local AI | Fixes & guides | Builds & gear | AI briefing

How to choose the right local LLM for 8GB, 12GB, and 24GB VRAM

Comments

Ready for more?