I told GaryExplains over at YouTube, that I would try to make time to benchmark Qwen 3.6 27b for feature work this week.
Well I did find some time, and the first results are in: Doing some work on document collaboration systems using the Pi agent in my standard agentic workflow that I also use for Claude Code and Codex: Plan, implement, demonstrate.
Agents always follow red-green test driven development practices, which I find helps agents stay focused and produce better quality.
The model proved capable of developing detailed sprint plans based on either specifications or simple requirements, both in an existing codebase and a new project.
The model is further capable of following sprint plans, while also implementing a workable set of unit, integration and E2E tests. Some failures slipped through the TDD cycles, but were caught in the demonstration phase where the model is forced to run it’s own code to demonstrate each feature. A few significant items were picked up in code review, including a memory leak.
The three sprints each contained 30 – 50 tasks and ran for 23 to 57 minutes. The agent was instructed to work on items one at a time, commit regularly and always follow TDD. I saw no reasoning loops, no stubs/excuses, no interrupted runs and, perhaps most impressively, not a single malformed tool call. Completed milestones included building a protoype text editor in ts with a node backend and a complex data model, adding several features and improvements to said prototype, and finally adding generative AI features to an existing codebase.
The prototype sprints resulted in 2131 lines of code and 53 tests including 33 playwright workflows after consuming just over 10 million tokens in the process. These are fewer unit and integrations tests than I am used to seeing from the frontier models, but perhaps more E2E tests, and the result was a functional and demo-able prototype. None of the feature code was of good enough quality to be accepted into a proper application, but then AI generated features rarely/never are.
After sprint completion the model was able to submit merge requests to GitLab and follow up on guidance from automatic code review (GitLab Duo Agentic Platform). However, it was noticably less critical to unimportant or irrelevant suggestions from the DAP code review agent than the frontier models.
Setup: Qwen 3.6 27b 8-bit LMX (lmx-community) using LM Studio LMX M5 runtime on a 16-inch MacBook Pro M5 Max 18/40 128 GB.
Model config: Temperature 0.1, top K sampling 100, repeat penalty 1.1
Context: The model context window at full precision set to the maximum 256k, however Pi was configured to use only 128k.
So there we are, the state of the art: Today’s strongest open weighted model for coding, running on the very best commercially available laptop (for this kind of work, anyway), can in fact create usable prototypes or feature-demos at some 15 tokens per second while air-gapped from the Internet or sitting on an airplane half-way across the Atlantic.
In conclusion, I still lean towards agreeing with Gary in principle. While you can do proper coding work on local models it probably only makes sense if you can not use a frontier model, for whatever reason. Claude Code will generate at 50-100 tokens/second and provide higher quality on average. The $8,000 for such a MacBook (or $10,000 if you happen to live here in Norway) will get you a long time with Claude Max x20 even if you add the cost of a more typical laptop.
Setting up a shared system with RTX Pro 6000 should sufficiently offset both performance and cost, however. If we only consider those parameters, $20,000 or even $30,000 for a workstation with two cards, really sounds quite workable to support an entire product team with 50+ tokens per second per developer. Certainly for any work that can not be delegated to a cloud hosted model.