Two years ago, using an open-source LLM for serious work meant accepting a meaningful capability gap versus GPT-4 or Claude. That's no longer true. In 2026, MIT-licensed models like Kimi K2.5 and GLM-5 now approach proprietary frontier models on several coding and reasoning benchmarks. For teams with data privacy requirements, the need to fine-tune on their own data, or the desire to avoid recurring API costs, the open-source tier is now a viable primary choice, not just a fallback.
This guide covers the top 10 open-source and open-weight LLMs from the Onyx Open LLM Leaderboard, updated as of March 12, 2026.
How this guide is sourced: Licensing, benchmark, parameter, and API availability data comes from the Onyx Open LLM Leaderboard. The recommendations in each section are editorial guidance for teams comparing open-source and open-weight options.
TL;DR: Open-source LLMs have closed most of the gap with proprietary models for coding and reasoning tasks. Kimi K2.5 and GLM-5 are the two strongest picks: Kimi K2.5 leads on code generation and math under an MIT license, while GLM-5 is the best open-weight model for autonomously fixing real software bugs. For teams that need Apache 2.0 licensing, Qwen 3.5 leads on reasoning. If you want a cheap hosted API rather than self-hosting, DeepSeek V3.2 at $0.28/M is the best reference point. For teams that want to run a capable model on a single H100, GPT-oss 120B is the practical choice.
An open-source large language model is one whose weights are publicly available for download, so you can run it on your own hardware, fine-tune it on your own data, and deploy it without paying per-token API fees. The license determines what you can actually do with it commercially.
The most permissive licenses are MIT and Apache 2.0, which allow unrestricted commercial use. The Llama License (Meta) and Gemma License (Google) are open for most uses but have specific restrictions.
License types in this guide:
| License | Commercial Use | Fine-Tuning | Redistribution | Restrictions |
|---|---|---|---|---|
| MIT | Yes | Yes | Yes | None |
| Apache 2.0 | Yes | Yes | Yes | Attribution required |
| Llama License | Yes (under 700M users) | Yes | Yes | Requires Meta approval above threshold |
| Gemma License | Yes | Yes | Yes | Prohibits uses that harm Google products |
| Open Weight | Varies | Varies | Varies | Check per model |
| Model | Provider | License | Params (Total/Active) | SWE-bench | GPQA Diamond | AIME 2025 | HumanEval | Arena Elo |
|---|---|---|---|---|---|---|---|---|
| Kimi K2.5 | Moonshot | MIT | 1T / 32B | 76.8% | 87.6% | 96.1% | 99.0% | 1,447 |
| GLM-5 | Zhipu AI | MIT | 744B / 40B | 77.8% | 86.0% | 84.0% | 90.0% | 1,451 |
| GLM-4.7 | Zhipu AI | MIT | 355B / 32B | 73.8% | 85.7% | 95.7% | 94.2% | 1,445 |
| Qwen 3.5 | Qwen | Apache 2.0 | 397B / 17B | 76.4% | 88.4% | N/A | N/A | N/A |
| MiMo-V2-Flash | Xiaomi | MIT | 309B / 15B | 73.4% | 83.7% | 94.1% | 84.8% | 1,401 |
| DeepSeek V3.2 | DeepSeek | Unlicensed | 685B / 37B | 67.8% | 79.9% | 89.3% | N/A | 1,421 |
| Qwen 3 235B | Qwen | Apache 2.0 | 235B / 22B | N/A | 81.1% | 92.3% | N/A | 1,422 |
| Step-3.5-Flash | Stepfun | Proprietary API | 196B / 11B | 74.4% | N/A | 97.3% | 81.1% | N/A |
| MiniMax M2.5 | MiniMax | Proprietary API | 230B / 10B | 80.2% | 85.2% | 86.3% | 89.6% | N/A |
| GPT-oss 120B | OpenAI | Apache 2.0 | 117B / 5.1B | 62.4% | 80.9% | 97.9% | 88.3% | 1,354 |
Source: Onyx Open LLM Leaderboard, last updated March 12, 2026.
Best at fixing real bugs: GLM-5, Kimi K2.5, Qwen 3.5, GLM-4.7, MiMo-V2-Flash
Best code generation: Kimi K2.5, GLM-4.7, GLM-5, GPT-oss 120B, MiMo-V2-Flash
Best reasoning: Qwen 3.5, Kimi K2.5, GLM-5, GLM-4.7, MiMo-V2-Flash
Best math: GPT-oss 120B, Kimi K2.5, GLM-4.7, MiMo-V2-Flash, Qwen 3 235B
Low-cost API options: DeepSeek V3.2 at $0.28/M, Kimi K2.5 free API, Step-3.5-Flash at $0.10/M
Facts: Kimi K2.5 is listed as MIT-licensed with 1T total / 32B active parameters, 76.8% SWE-bench, 87.6% GPQA Diamond, 96.1% AIME 2025, 99.0% HumanEval, and 1,447 Arena Elo.
Recommendation: Choose Kimi K2.5 when you want the strongest combination of code generation, math performance, and permissive licensing in one model.
Facts: GLM-5 is listed as MIT-licensed with 744B total / 40B active parameters, 77.8% SWE-bench, 86.0% GPQA Diamond, 84.0% AIME 2025, 90.0% HumanEval, and 1,451 Arena Elo.
Recommendation: Pick GLM-5 if your top priority is open-weight software-engineering performance rather than the best HumanEval or math score.
Facts: GLM-4.7 is listed as MIT-licensed with 355B total / 32B active parameters, 73.8% SWE-bench, 85.7% GPQA Diamond, 95.7% AIME 2025, 94.2% HumanEval, and 1,445 Arena Elo.
Recommendation: GLM-4.7 is the better fit than GLM-5 when you still want strong open-weight coding performance but need a somewhat more accessible deployment profile.
Facts: Qwen 3.5 is listed as Apache 2.0 licensed with 397B total / 17B active parameters, 76.4% SWE-bench, and 88.4% GPQA Diamond. HumanEval, AIME, and Arena Elo are not listed in this snapshot.
Recommendation: Use Qwen 3.5 when you need an Apache-licensed model with especially strong reasoning performance.
Facts: MiMo-V2-Flash is listed as MIT-licensed with 309B total / 15B active parameters, 73.4% SWE-bench, 83.7% GPQA Diamond, 94.1% AIME 2025, 84.8% HumanEval, and 1,401 Arena Elo.
Recommendation: MiMo-V2-Flash is a sensible option when you want solid math and coding performance under MIT without reaching for the largest models in this category.
Facts: DeepSeek V3.2 is listed with public weights, no standard open-source license, 685B total / 37B active parameters, 67.8% SWE-bench, 79.9% GPQA Diamond, 89.3% AIME 2025, 1,421 Arena Elo, and $0.28/M input pricing.
Recommendation: DeepSeek V3.2 is a cost-driven choice for teams comfortable reviewing non-standard licensing terms before production use.
Facts: Qwen 3 235B is listed as Apache 2.0 licensed with 235B total / 22B active parameters, 81.1% GPQA Diamond, 92.3% AIME 2025, and 1,422 Arena Elo. SWE-bench and HumanEval are not listed in this snapshot.
Recommendation: Choose Qwen 3 235B when you want Apache licensing and stronger reasoning signals than you need coding-specific benchmark depth.
Facts: Step-3.5-Flash is listed as a proprietary API model with 196B total / 11B active parameters, 74.4% SWE-bench, 97.3% AIME 2025, 81.1% HumanEval, and $0.10/M input pricing.
Recommendation: Step-3.5-Flash is useful as a budget benchmark in this comparison, but it is not the right pick if self-hosting or open-weight access is your actual requirement.
Facts: MiniMax M2.5 is listed as a proprietary API model with 230B total / 10B active parameters, 80.2% SWE-bench, 85.2% GPQA Diamond, 86.3% AIME 2025, 89.6% HumanEval, and $0.30/M input pricing.
Recommendation: MiniMax M2.5 is relevant here mainly as a price-performance reference point for teams deciding whether open-weight deployment is worth the tradeoff.
Facts: GPT-oss 120B is listed as Apache 2.0 licensed with 117B total / 5.1B active parameters, 62.4% SWE-bench, 80.9% GPQA Diamond, 97.9% AIME 2025, 88.3% HumanEval, and 1,354 Arena Elo.
Recommendation: GPT-oss 120B is the strongest option in this list when single-node self-hosting and Apache licensing matter more than reaching the top of SWE-bench.
| Use Case | Best Model | License | Key Score |
|---|---|---|---|
| Best coding | Kimi K2.5 | MIT | 99% HumanEval, 76.8% SWE-bench |
| Best reasoning | Qwen 3.5 | Apache 2.0 | 88.4% GPQA Diamond |
| Best math | GLM-4.7 | MIT | 95.7% AIME 2025 |
| Best at fixing real bugs | GLM-5 | MIT | 77.8% SWE-bench |
| Cheapest API with frontier scores | DeepSeek V3.2 | Unlicensed | $0.28/M input |
| Best single-H100 deployment | GPT-oss 120B | Apache 2.0 | 62.4% SWE-bench, 97.9% AIME |
| Best algorithmic tasks at low cost | Step-3.5-Flash | API | $0.10/M, 74.4% SWE-bench |
| Best for fine-tuning (unrestricted) | Kimi K2.5 | MIT | Fully open weights |
Choosing an MIT or Apache 2.0 model is only part of the decision. Teams also need a way to compare hosted and self-hosted backends, connect those models to internal data, and preserve permissions.
Onyx is useful in this context because it gives teams a common application layer on top of open-source models. You can test a hosted API against a self-hosted endpoint, connect the chosen model to sources like Slack, Confluence, Jira, Google Drive, and GitHub, and keep permission-aware retrieval in front of users. That makes it easier to act on the licensing and deployment tradeoffs in this guide instead of evaluating each model in isolation.
What is the best open-source LLM in 2026?
In this leaderboard snapshot, there is no single winner across every benchmark. Kimi K2.5 leads HumanEval (99%) and AIME (96.1%). GLM-5 has the strongest SWE-bench result among open models (77.8%). Qwen 3.5 has the highest GPQA Diamond score among the open-source models listed (88.4%). The best choice depends on whether you care most about coding, reasoning, licensing, or deployment constraints.
What is the difference between open-source and open-weight LLMs?
Open-source LLMs have publicly available weights, architecture, and (ideally) training code under a permissive license like MIT or Apache 2.0. Open-weight models release weights but may have proprietary training code or restrictive license terms. In practice, most "open-source" LLMs are open-weight: you can download and run them, but full source code and training data are rarely published.
Which open-source LLMs can I use commercially?
MIT and Apache 2.0 licensed models allow commercial use without restriction: Kimi K2.5, GLM-4.7, GLM-5, MiMo-V2-Flash, GPT-oss 120B, Qwen 3.5, Qwen 3 235B. DeepSeek V3.2 has a non-standard license that requires review for commercial deployments. Step-3.5-Flash and MiniMax M2.5 are proprietary API models. Always check the specific license terms for your deployment scenario.
Can I run these open-source models locally?
Most models in this list require enterprise hardware (4x H100 80GB or equivalent) for full-precision inference. More accessible options include GPT-oss 120B (1x H100) and the DeepSeek R1 distilled variants (DS-R1-Distill-Qwen-32B, DS-R1-Distill-Llama-70B) that run on a single RTX 4090 or H100. See the Best Self-Hosted LLMs 2026 guide for hardware requirements per model.
What is the best platform for running open-source and proprietary LLMs together?
Most teams end up mixing models: a self-hosted open-weight model for sensitive data, a cheap API for high-volume tasks, and a frontier model for the hardest work. Onyx gives teams a single interface to connect all of these, routing tasks to the right model while keeping answers grounded in company knowledge from Slack, Confluence, Jira, Google Drive, and GitHub. It's MIT-licensed, supports self-hosted and API-based backends, and is free to get started.
Related Insights
Best LLMs for Coding in 2026
Claude Opus 4.6 leads SWE-bench Verified at 80.8%. GPT-5.4 leads Terminal-Bench at 75.1%. Full benchmark breakdown for 10 coding LLMs with cost comparison and open-source picks.
Best LLMs in 2026
Compare leading large language models in Onyx's leaderboard snapshot. Benchmark data for Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Kimi K2.5, GLM-5, and more across SWE-bench, GPQA Diamond, and AIME.