All Insights

AI Tools11 min readPublished Mar 6, 2026

Best Open Source LLMs in 2026

Roshan Desai

By Roshan Desai

Two years ago, using an open-source LLM for serious work meant accepting a meaningful capability gap versus GPT-4 or Claude. That's no longer true. In 2026, MIT-licensed models like Kimi K2.5 and GLM-5 now approach proprietary frontier models on several coding and reasoning benchmarks. For teams with data privacy requirements, the need to fine-tune on their own data, or the desire to avoid recurring API costs, the open-source tier is now a viable primary choice, not just a fallback.

This guide covers the top 10 open-source and open-weight LLMs from the Onyx Open LLM Leaderboard, updated as of March 12, 2026.

How this guide is sourced: Licensing, benchmark, parameter, and API availability data comes from the Onyx Open LLM Leaderboard. The recommendations in each section are editorial guidance for teams comparing open-source and open-weight options.


TL;DR: Open-source LLMs have closed most of the gap with proprietary models for coding and reasoning tasks. Kimi K2.5 and GLM-5 are the two strongest picks: Kimi K2.5 leads on code generation and math under an MIT license, while GLM-5 is the best open-weight model for autonomously fixing real software bugs. For teams that need Apache 2.0 licensing, Qwen 3.5 leads on reasoning. If you want a cheap hosted API rather than self-hosting, DeepSeek V3.2 at $0.28/M is the best reference point. For teams that want to run a capable model on a single H100, GPT-oss 120B is the practical choice.


What Is an Open-Source LLM?

An open-source large language model is one whose weights are publicly available for download, so you can run it on your own hardware, fine-tune it on your own data, and deploy it without paying per-token API fees. The license determines what you can actually do with it commercially.

The most permissive licenses are MIT and Apache 2.0, which allow unrestricted commercial use. The Llama License (Meta) and Gemma License (Google) are open for most uses but have specific restrictions.

License types in this guide:

LicenseCommercial UseFine-TuningRedistributionRestrictions
MITYesYesYesNone
Apache 2.0YesYesYesAttribution required
Llama LicenseYes (under 700M users)YesYesRequires Meta approval above threshold
Gemma LicenseYesYesYesProhibits uses that harm Google products
Open WeightVariesVariesVariesCheck per model

Best Open Source LLMs 2026: Comparison Table

ModelProviderLicenseParams (Total/Active)SWE-benchGPQA DiamondAIME 2025HumanEvalArena Elo
Kimi K2.5MoonshotMIT1T / 32B76.8%87.6%96.1%99.0%1,447
GLM-5Zhipu AIMIT744B / 40B77.8%86.0%84.0%90.0%1,451
GLM-4.7Zhipu AIMIT355B / 32B73.8%85.7%95.7%94.2%1,445
Qwen 3.5QwenApache 2.0397B / 17B76.4%88.4%N/AN/AN/A
MiMo-V2-FlashXiaomiMIT309B / 15B73.4%83.7%94.1%84.8%1,401
DeepSeek V3.2DeepSeekUnlicensed685B / 37B67.8%79.9%89.3%N/A1,421
Qwen 3 235BQwenApache 2.0235B / 22BN/A81.1%92.3%N/A1,422
Step-3.5-FlashStepfunProprietary API196B / 11B74.4%N/A97.3%81.1%N/A
MiniMax M2.5MiniMaxProprietary API230B / 10B80.2%85.2%86.3%89.6%N/A
GPT-oss 120BOpenAIApache 2.0117B / 5.1B62.4%80.9%97.9%88.3%1,354

Source: Onyx Open LLM Leaderboard, last updated March 12, 2026.


Top Open Source Models at a Glance

Best at fixing real bugs: GLM-5, Kimi K2.5, Qwen 3.5, GLM-4.7, MiMo-V2-Flash

Best code generation: Kimi K2.5, GLM-4.7, GLM-5, GPT-oss 120B, MiMo-V2-Flash

Best reasoning: Qwen 3.5, Kimi K2.5, GLM-5, GLM-4.7, MiMo-V2-Flash

Best math: GPT-oss 120B, Kimi K2.5, GLM-4.7, MiMo-V2-Flash, Qwen 3 235B

Low-cost API options: DeepSeek V3.2 at $0.28/M, Kimi K2.5 free API, Step-3.5-Flash at $0.10/M


Top Open Source Models: Detailed Reviews

1. Kimi K2.5 (Moonshot, MIT)

Facts: Kimi K2.5 is listed as MIT-licensed with 1T total / 32B active parameters, 76.8% SWE-bench, 87.6% GPQA Diamond, 96.1% AIME 2025, 99.0% HumanEval, and 1,447 Arena Elo.

Recommendation: Choose Kimi K2.5 when you want the strongest combination of code generation, math performance, and permissive licensing in one model.

2. GLM-5 (Zhipu AI, MIT)

Facts: GLM-5 is listed as MIT-licensed with 744B total / 40B active parameters, 77.8% SWE-bench, 86.0% GPQA Diamond, 84.0% AIME 2025, 90.0% HumanEval, and 1,451 Arena Elo.

Recommendation: Pick GLM-5 if your top priority is open-weight software-engineering performance rather than the best HumanEval or math score.

3. GLM-4.7 (Zhipu AI, MIT)

Facts: GLM-4.7 is listed as MIT-licensed with 355B total / 32B active parameters, 73.8% SWE-bench, 85.7% GPQA Diamond, 95.7% AIME 2025, 94.2% HumanEval, and 1,445 Arena Elo.

Recommendation: GLM-4.7 is the better fit than GLM-5 when you still want strong open-weight coding performance but need a somewhat more accessible deployment profile.

4. Qwen 3.5 (Alibaba, Apache 2.0)

Facts: Qwen 3.5 is listed as Apache 2.0 licensed with 397B total / 17B active parameters, 76.4% SWE-bench, and 88.4% GPQA Diamond. HumanEval, AIME, and Arena Elo are not listed in this snapshot.

Recommendation: Use Qwen 3.5 when you need an Apache-licensed model with especially strong reasoning performance.

5. MiMo-V2-Flash (Xiaomi, MIT)

Facts: MiMo-V2-Flash is listed as MIT-licensed with 309B total / 15B active parameters, 73.4% SWE-bench, 83.7% GPQA Diamond, 94.1% AIME 2025, 84.8% HumanEval, and 1,401 Arena Elo.

Recommendation: MiMo-V2-Flash is a sensible option when you want solid math and coding performance under MIT without reaching for the largest models in this category.

6. DeepSeek V3.2 (DeepSeek)

Facts: DeepSeek V3.2 is listed with public weights, no standard open-source license, 685B total / 37B active parameters, 67.8% SWE-bench, 79.9% GPQA Diamond, 89.3% AIME 2025, 1,421 Arena Elo, and $0.28/M input pricing.

Recommendation: DeepSeek V3.2 is a cost-driven choice for teams comfortable reviewing non-standard licensing terms before production use.

7. Qwen 3 235B (Alibaba, Apache 2.0)

Facts: Qwen 3 235B is listed as Apache 2.0 licensed with 235B total / 22B active parameters, 81.1% GPQA Diamond, 92.3% AIME 2025, and 1,422 Arena Elo. SWE-bench and HumanEval are not listed in this snapshot.

Recommendation: Choose Qwen 3 235B when you want Apache licensing and stronger reasoning signals than you need coding-specific benchmark depth.

8. Step-3.5-Flash (Stepfun)

Facts: Step-3.5-Flash is listed as a proprietary API model with 196B total / 11B active parameters, 74.4% SWE-bench, 97.3% AIME 2025, 81.1% HumanEval, and $0.10/M input pricing.

Recommendation: Step-3.5-Flash is useful as a budget benchmark in this comparison, but it is not the right pick if self-hosting or open-weight access is your actual requirement.

9. MiniMax M2.5 (MiniMax)

Facts: MiniMax M2.5 is listed as a proprietary API model with 230B total / 10B active parameters, 80.2% SWE-bench, 85.2% GPQA Diamond, 86.3% AIME 2025, 89.6% HumanEval, and $0.30/M input pricing.

Recommendation: MiniMax M2.5 is relevant here mainly as a price-performance reference point for teams deciding whether open-weight deployment is worth the tradeoff.

10. GPT-oss 120B (OpenAI, Apache 2.0)

Facts: GPT-oss 120B is listed as Apache 2.0 licensed with 117B total / 5.1B active parameters, 62.4% SWE-bench, 80.9% GPQA Diamond, 97.9% AIME 2025, 88.3% HumanEval, and 1,354 Arena Elo.

Recommendation: GPT-oss 120B is the strongest option in this list when single-node self-hosting and Apache licensing matter more than reaching the top of SWE-bench.


Open Source LLMs: Best for Each Use Case

Use CaseBest ModelLicenseKey Score
Best codingKimi K2.5MIT99% HumanEval, 76.8% SWE-bench
Best reasoningQwen 3.5Apache 2.088.4% GPQA Diamond
Best mathGLM-4.7MIT95.7% AIME 2025
Best at fixing real bugsGLM-5MIT77.8% SWE-bench
Cheapest API with frontier scoresDeepSeek V3.2Unlicensed$0.28/M input
Best single-H100 deploymentGPT-oss 120BApache 2.062.4% SWE-bench, 97.9% AIME
Best algorithmic tasks at low costStep-3.5-FlashAPI$0.10/M, 74.4% SWE-bench
Best for fine-tuning (unrestricted)Kimi K2.5MITFully open weights

Using Open-Source LLMs in Enterprise Workflows

Choosing an MIT or Apache 2.0 model is only part of the decision. Teams also need a way to compare hosted and self-hosted backends, connect those models to internal data, and preserve permissions.

Onyx is useful in this context because it gives teams a common application layer on top of open-source models. You can test a hosted API against a self-hosted endpoint, connect the chosen model to sources like Slack, Confluence, Jira, Google Drive, and GitHub, and keep permission-aware retrieval in front of users. That makes it easier to act on the licensing and deployment tradeoffs in this guide instead of evaluating each model in isolation.


Frequently Asked Questions

What is the best open-source LLM in 2026?

In this leaderboard snapshot, there is no single winner across every benchmark. Kimi K2.5 leads HumanEval (99%) and AIME (96.1%). GLM-5 has the strongest SWE-bench result among open models (77.8%). Qwen 3.5 has the highest GPQA Diamond score among the open-source models listed (88.4%). The best choice depends on whether you care most about coding, reasoning, licensing, or deployment constraints.

What is the difference between open-source and open-weight LLMs?

Open-source LLMs have publicly available weights, architecture, and (ideally) training code under a permissive license like MIT or Apache 2.0. Open-weight models release weights but may have proprietary training code or restrictive license terms. In practice, most "open-source" LLMs are open-weight: you can download and run them, but full source code and training data are rarely published.

Which open-source LLMs can I use commercially?

MIT and Apache 2.0 licensed models allow commercial use without restriction: Kimi K2.5, GLM-4.7, GLM-5, MiMo-V2-Flash, GPT-oss 120B, Qwen 3.5, Qwen 3 235B. DeepSeek V3.2 has a non-standard license that requires review for commercial deployments. Step-3.5-Flash and MiniMax M2.5 are proprietary API models. Always check the specific license terms for your deployment scenario.

Can I run these open-source models locally?

Most models in this list require enterprise hardware (4x H100 80GB or equivalent) for full-precision inference. More accessible options include GPT-oss 120B (1x H100) and the DeepSeek R1 distilled variants (DS-R1-Distill-Qwen-32B, DS-R1-Distill-Llama-70B) that run on a single RTX 4090 or H100. See the Best Self-Hosted LLMs 2026 guide for hardware requirements per model.

What is the best platform for running open-source and proprietary LLMs together?

Most teams end up mixing models: a self-hosted open-weight model for sensitive data, a cheap API for high-volume tasks, and a frontier model for the hardest work. Onyx gives teams a single interface to connect all of these, routing tasks to the right model while keeping answers grounded in company knowledge from Slack, Confluence, Jira, Google Drive, and GitHub. It's MIT-licensed, supports self-hosted and API-based backends, and is free to get started.