GLM-5.1: The First Open-Weight Model to Lead SWE-Bench Pro

The benchmark that matters most for coding agents has a new leader — and it’s open-weight.

On April 8, Z.AI launched GLM-5.1: a Mixture-of-Experts model with 744B total parameters and 40B active per forward pass. It achieved 58.4 on SWE-Bench Pro, surpassing GPT-5.4 (57.7) and Claude Opus 4.6 (57.3), becoming the first open-weight system to lead that leaderboard. The weights are published under MIT license and available on Hugging Face.

Why this is more than just another benchmark announcement

Most “new #1 on benchmark X” launches are single-metric stories. GLM-5.1 is not. The full profile: 95.3 on AIME 2026, 86.2 on GPQA-Diamond, 68.7 on CyberGym (up from 48.3 of its predecessor GLM-5), 71.8 on MCP-Atlas. The model advances simultaneously in reasoning, coding, agents, tool use, and browsing. That breadth matters more than the SWE-Bench headline.

But the real engineering story is what Z.AI calls long-horizon autonomy. Previous models — including GLM-5 — hit a plateau: they apply known techniques for quick early gains, then stall. Giving them more time doesn’t help. GLM-5.1 is explicitly designed to break that pattern. It can sustain a complex engineering task for up to 8 hours, executing hundreds of tool calls and thousands of self-revision rounds without human intervention. The model revisits its reasoning, revises its strategy, and stays productive instead of drifting.

This matters for developers building autonomous agents. The difference between a model that hits its plateau in the first hour and one that keeps improving through the eighth hour isn’t just performance: it defines whether autonomous engineering tasks are actually feasible without constant human oversight.

The architecture behind sustained performance

GLM-5.1 runs on a glm_moe_dsa architecture — MoE combined with DSA (Dual Sparse Attention). MoE activates only a subset of parameters per forward pass, which is why a 744B model can operate with the compute footprint of a much smaller, dense model. On the training side, Z.AI implemented asynchronous reinforcement learning that decouples generation from training, allowing the model to learn effectively from long, complex interactions — the kind that single-turn RL struggles to handle.

Practical reality of self-hosting

The MIT license and Hugging Face availability are real. But MoE models require specific serving infrastructure — it’s not a model you spin up with standard setup and conventional hardware. The 40B active parameters make inference tractable, but you need a serving stack that understands sparse expert routing. If your team is evaluating self-hosted deployment, budget time for infrastructure work beyond download.

For most teams today, Z.AI’s API platform is the practical path to using GLM-5.1 in production.

The fact that shouldn’t get buried

GLM-5.1 was trained entirely on Huawei Ascend 910B chips — zero Nvidia hardware. For developers following AI infrastructure dynamics and supply chains, this is significant. It demonstrates that it’s possible to train state-of-the-art open-weight models outside the Nvidia ecosystem at scale. Whether that affects your stack decisions today is a separate question. But it signals something important about where open-weight model development could come from in the years ahead.

Conclusion

If you’re building coding agents or evaluating foundation models for long-duration autonomous tasks, GLM-5.1 is a serious option worth testing via API. Open weights under MIT make it viable for production use cases where model sovereignty matters. The infrastructure requirements for self-hosting are real — plan accordingly.

SWE-Bench Pro scores: GLM-5.1 (58.4) · GPT-5.4 (57.7) · Claude Opus 4.6 (57.3) · Qwen3.6-Plus (56.6) · Minimax M2.7 (56.2) · Gemini 3.1 Pro (54.2) · Kimi K2.5 (53.8) · GLM-5 (55.1)