The model that writes the cleanest function isn't necessarily the best one for shipping features. Generating code from a description and fixing bugs in a real codebase are different skills, and the best model for each is not always the same. This guide focuses on the coding models in the Onyx Coding LLM Leaderboard, updated March 12, 2026. In that dataset, Claude Opus 4.6 leads on autonomous bug-fixing, GPT-5.4 leads on agentic terminal tasks, Kimi K2.5 and GLM-5 are the strongest open-source options, and MiniMax M2.5 or Step-3.5-Flash stand out when API cost matters.
How this guide is sourced: Coding benchmark, pricing, and license data comes from the Onyx Coding LLM Leaderboard. The recommendations in each review are editorial guidance based on that dataset.
TL;DR: For serious software engineering, Claude Opus 4.6 leads on autonomous bug-fixing with 80.8% SWE-bench. For agentic coding and terminal tasks, GPT-5.4 leads with 75.1% Terminal-Bench and 75% OSWorld-Verified, the strongest agentic profile in this dataset. If cost is the constraint, MiniMax M2.5 matches frontier-level bug-fixing at $0.30/M, and Step-3.5-Flash goes even lower at $0.10/M. For open-source, Kimi K2.5 leads on code generation under an MIT license, while GLM-5 is the strongest open-weight model for fixing real bugs.
Not all coding tasks are the same, and different models are optimized for different ones.
Writing a function from a description is easy for almost every modern LLM. Navigating an unfamiliar 50,000-line codebase, finding where a bug originates, and writing a fix that doesn't break three other things: that's much harder, and that's what separates the top models from the rest.
The four benchmarks in this guide each test something different:
A model that excels on HumanEval but has a low SWE-bench score is a good autocomplete tool. A model that leads on SWE-bench and Terminal-Bench is what you want for serious engineering work.
| Model | Provider | SWE-bench | HumanEval | LiveCode | Terminal-Bench | API Cost (per 1M in/out) | License |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 80.8% | 95.0% | 76.0% | 65.4% | $15 / $75 | Proprietary |
| GPT-5.4 | OpenAI | N/A† | N/A | N/A | 75.1% | $2.50 / $15 | Proprietary |
| MiniMax M2.5 | MiniMax | 80.2% | 89.6% | 65.0% | 42.2% | $0.30 / $1.20 | API |
| Claude Sonnet 4.6 | Anthropic | 79.6% | 92.1% | 72.4% | 59.1% | $3 / $15 | Proprietary |
| Gemini 3.1 Pro | 78.0% | 93.0% | 81.3% | 56.2% | $2 / $12 | Proprietary | |
| GLM-5 | Zhipu AI | 77.8% | 90.0% | 52.0% | 56.2% | Free API | MIT |
| Kimi K2.5 | Moonshot | 76.8% | 99.0% | 85.0% | 50.8% | Free API | MIT |
| Qwen 3.5 | Qwen | 76.4% | N/A | 83.6% | 52.5% | Free API | Apache 2.0 |
| Step-3.5-Flash | Stepfun | 74.4% | 81.1% | 86.4% | 51.0% | $0.10 / $0.30 | API |
| MiMo-V2-Flash | Xiaomi | 73.4% | 84.8% | 80.6% | 38.5% | Free API | MIT |
† SWE-bench Verified not yet published for GPT-5.4. SWE-bench Pro = 57.7%.
Source: Onyx Coding LLM Leaderboard, last updated March 12, 2026.
Best at fixing real bugs: Claude Opus 4.6, MiniMax M2.5, Claude Sonnet 4.6, Gemini 3.1 Pro, GLM-5
Best code generation: Kimi K2.5, Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Claude Sonnet 4.6
Best for competitive programming: Step-3.5-Flash, Kimi K2.5, Qwen 3.5, Gemini 3.1 Pro, GPT-5.4
Best for agentic coding: GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, GLM-5, Gemini 3.1 Pro
Best reasoning for coding tasks: GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Claude Sonnet 4.6
Best open-source: GLM-5, Kimi K2.5, Qwen 3.5, MiMo-V2-Flash
Best budget API: MiniMax M2.5 at $0.30/M, Step-3.5-Flash at $0.10/M
Claude Opus 4.6 scores 80.8% on SWE-bench Verified, 95.0% on HumanEval, 76.0% on LiveCodeBench, 65.4% on Terminal-Bench, and costs $15 / $75 per 1M input/output tokens. Use it when you are optimizing for maximum success rate on difficult software-engineering tasks, particularly where autonomous bug-fixing quality is the top priority.
GPT-5.4 leads this dataset on agentic tasks: 75.1% Terminal-Bench, 75% OSWorld-Verified, and 82.7% BrowseComp. It also scores 92.8% on GPQA Diamond, showing strong reasoning alongside its agentic performance. SWE-bench Verified hasn't been published yet (SWE-bench Pro is 57.7%), and it costs $2.50 / $15 per 1M input/output tokens. Choose GPT-5.4 when agentic coding, terminal task automation, or computer control are the primary use case.
MiniMax M2.5 scores 80.2% on SWE-bench Verified, 89.6% on HumanEval, 65.0% on LiveCodeBench, 42.2% on Terminal-Bench, and costs $0.30 / $1.20 per 1M input/output tokens. It is the first model to evaluate when software-engineering throughput per dollar matters more than terminal-agent performance.
Claude Sonnet 4.6 scores 79.6% on SWE-bench Verified, 92.1% on HumanEval, 72.4% on LiveCodeBench, 59.1% on Terminal-Bench, and costs $3 / $15 per 1M input/output tokens. Pick it when you want Anthropic compatibility and strong coding performance without stepping up to Opus pricing.
Gemini 3.1 Pro scores 78.0% on SWE-bench Verified, 93.0% on HumanEval, 81.3% on LiveCodeBench, 56.2% on Terminal-Bench, and costs $2 / $12 per 1M input/output tokens. Start with Gemini 3.1 Pro when you want strong coding coverage and the lowest proprietary API price among frontier models, especially for code generation and algorithm-heavy tasks.
GLM-5 scores 77.8% on SWE-bench Verified, 90.0% on HumanEval, 52.0% on LiveCodeBench, 56.2% on Terminal-Bench, and is MIT-licensed with a free API option. It is the best fit in this ranking when you need MIT licensing and want a model optimized for practical software-engineering workflows.
Kimi K2.5 scores 76.8% on SWE-bench Verified, 99.0% on HumanEval, 85.0% on LiveCodeBench, 50.8% on Terminal-Bench, and is MIT-licensed with a free API option. Choose Kimi K2.5 when code generation and algorithmic problem solving matter more than terminal-agent performance.
Qwen 3.5 scores 76.4% on SWE-bench Verified, 83.6% on LiveCodeBench, 52.5% on Terminal-Bench, and is Apache 2.0 licensed with a free API option. HumanEval is not listed in this snapshot. Qwen 3.5 is the strongest option here when Apache licensing and competitive-programming-style coding are the key requirements.
Step-3.5-Flash scores 74.4% on SWE-bench Verified, 81.1% on HumanEval, 86.4% on LiveCodeBench, 51.0% on Terminal-Bench, and costs $0.10 / $0.30 per 1M input/output tokens. Use it when algorithmic coding volume is high and price sensitivity outweighs the need for the strongest end-to-end engineering model.
MiMo-V2-Flash scores 73.4% on SWE-bench Verified, 84.8% on HumanEval, 80.6% on LiveCodeBench, 38.5% on Terminal-Bench, and is MIT-licensed with a free API option. It is a balanced MIT-licensed option when you want broadly decent coding performance without targeting a specialized winner.
At 10 million output tokens per month (a typical usage level for a 50-person engineering team using AI coding assistance):
| Model | Monthly API Cost | SWE-bench | Notes |
|---|---|---|---|
| Claude Opus 4.6 | ~$750,000 | 80.8% | Highest SWE-bench, highest cost |
| Claude Sonnet 4.6 | ~$150,000 | 79.6% | Best Anthropic value |
| GPT-5.4 | ~$150,000 | N/A† | Best Terminal-Bench, strongest agentic use |
| Gemini 3.1 Pro | ~$100,000 | 78.0% | Cheapest frontier proprietary model |
| MiniMax M2.5 | ~$12,000 | 80.2% | Best value in the frontier tier |
| Step-3.5-Flash | ~$3,000 | 74.4% | Cheapest option with 74%+ SWE-bench |
| Kimi K2.5 | Free in this snapshot | 76.8% | MIT licensed, listed with free API |
| GLM-5 | Free in this snapshot | 77.8% | MIT licensed, listed with free API |
| Your Situation | Best Choice |
|---|---|
| Highest SWE-bench, cost not a constraint | Claude Opus 4.6 |
| Best agentic coding and terminal tasks | GPT-5.4 (75.1% Terminal-Bench, 75% OSWorld) |
| Best frontier performance per dollar | MiniMax M2.5 ($0.30/M input, 80.2% SWE-bench) |
| Cheapest API with high LiveCodeBench | Step-3.5-Flash ($0.10/M, 86.4% LiveCode) |
| Best open-source, MIT license | Kimi K2.5 (99% HumanEval) or GLM-5 (77.8% SWE-bench) |
| Apache 2.0, open weights | Qwen 3.5 (83.6% LiveCode) |
| Anthropic ecosystem, cost-optimized | Claude Sonnet 4.6 (79.6% SWE-bench, $15/M out) |
| Frontier coding at reasonable API cost | Gemini 3.1 Pro ($2/M) or Claude Sonnet 4.6 ($3/M) |
What is the best LLM for coding in 2026?
In this leaderboard snapshot, Claude Opus 4.6 has the strongest autonomous bug-fixing performance at 80.8% SWE-bench. GPT-5.4 leads on Terminal-Bench (75.1%) and OSWorld-Verified (75%), making it the best pick for agentic coding workflows. MiniMax M2.5 is notable for price-performance on SWE-bench. For open-source options, Kimi K2.5 and GLM-5 are the leading choices in this dataset.
What is SWE-bench Verified?
SWE-bench Verified is a benchmark where models are given real GitHub issues from popular Python projects and must autonomously write a code patch that makes the project's existing tests pass. Unlike HumanEval, which tests function-level code generation, SWE-bench tests the ability to understand entire codebases, locate relevant files, and write correct diffs. A score above 75% is considered frontier-level as of 2026.
Is Claude better than GPT for coding?
On SWE-bench Verified, Claude Opus 4.6 (80.8%) leads among models with published scores. On Terminal-Bench, GPT-5.4 (75.1%) leads Claude Opus 4.6 (65.4%) by a significant margin, making it the better choice for autonomous coding agents and terminal workflows. The right pick depends on whether you are optimizing for bug-fixing accuracy or agentic task completion.
What is the best open-source model for coding?
In this leaderboard snapshot, Kimi K2.5 and GLM-5 are the strongest open-source coding models. Kimi K2.5 leads on code generation with 99% HumanEval and 85% LiveCodeBench under an MIT license. GLM-5 is the better pick for fixing real bugs, with 77.8% SWE-bench and strong terminal performance, also MIT-licensed. Qwen 3.5 is the top Apache 2.0 option if you need that specific license. All three offer free API access in this snapshot.
Related Insights
Best Open Source LLMs in 2026
Compare 10 open-source and open-weight language models in Onyx's March 12, 2026 leaderboard snapshot. Benchmark data, license types, and API availability for Kimi K2.5, GLM-5, DeepSeek V3.2, Qwen 3.5, GPT-oss 120B, and more.
Best LLMs in 2026
Compare leading large language models in Onyx's leaderboard snapshot. Benchmark data for Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Kimi K2.5, GLM-5, and more across SWE-bench, GPQA Diamond, and AIME.