All Insights
By Roshan Desai
Self-hosting an LLM used to mean accepting significantly worse performance than the proprietary APIs. In 2026, that's changed. The best self-hosted models now approach proprietary APIs much more closely on coding benchmarks, but they require real hardware to run them. The question isn't just capability anymore. It's whether the economics make sense for your team, and which model fits your GPU setup.
This guide covers the top 10 self-hosted LLMs from the Onyx Self-Hosted LLM Leaderboard, organized by the hardware tier they require. Data is updated as of March 12, 2026.
How this guide is sourced: Hardware, benchmark, and license data comes from the Onyx Self-Hosted LLM Leaderboard. The recommendations in each section interpret that data for specific deployment tiers.
TL;DR: The right self-hosted model depends almost entirely on what GPU you have. GLM-4.7 is the entry point for cluster deployments at 4x H100 (320GB), Kimi K2.5 needs 4x H200 (564GB) but leads on code generation, and GLM-5 needs 4x H200 (564GB) for the strongest bug-fixing performance. A single H100 gets you Qwen3.5-122B-A10B, a 122B MoE model that runs efficiently via only 10B active parameters, or DeepSeek V3 with aggressive quantization. On an RTX 4090, Qwen3.5-27B is the general-purpose pick and DS-R1-Distill-Qwen-32B is the reasoning pick. For budget GPUs, Qwen3.5-9B runs on an RTX 3090 and Qwen3.5-4B runs on an RTX 3060 or smaller.
A self-hosted LLM is a model you run on hardware you control rather than calling via an API. Your data stays on your infrastructure, you don't pay per token, and you can customize the model for your specific use case.
Reasons teams choose to self-host:
| Tier | Hardware | Total VRAM | Approximate Cost | Best For |
|---|---|---|---|---|
| 4x H200 cluster | 4x H200 141GB | 564GB | $150K-$250K | Large MoE models up to ~750B |
| 4x H100 cluster | 4x H100 80GB | 320GB | $60K-$100K | Large MoE models |
| Dual H100 | 2x H100 80GB | 160GB | $30K-$50K | 200B+ MoE, 120B dense |
| Single H100 | 1x H100 80GB | 80GB | $15K-$25K | 70B-120B models |
| RTX 4090 | 1x RTX 4090 24GB | 24GB | $1,500-$2,000 | 27B-32B models in INT4 |
| RTX 3090 | 1x RTX 3090 24GB | 24GB | $400-$700 | 7B-14B models |
| RTX 3060 | 1x RTX 3060 12GB | 12GB | $250-$400 | 4B-7B models |
4x H200 (564GB): Kimi K2.5, GLM-5
4x H100 (320GB): GLM-4.7
Single H100: Qwen3.5-122B-A10B, DeepSeek V3, GPT-oss 120B
Dual GPU (2x H100 / 2x A100): MiniMax M2.5, Qwen3-235B-A22B, Step-3.5-Flash
RTX 4090: Qwen3.5-27B, DS-R1-Distill-Qwen-32B
RTX 3090: Qwen3.5-9B
RTX 3060 and smaller: Qwen3.5-4B
These models require multi-GPU inference but deliver performance that matches proprietary frontier models. Hardware requirements vary within this tier: GLM-4.7 fits on 4x H100 80GB (320GB total), while Kimi K2.5 and GLM-5 need 4x H200 141GB (564GB total). Recommended for enterprise teams that already have GPU clusters or are willing to invest in the hardware to avoid API costs long-term.
| Model | License | Params (Total/Active) | SWE-bench | HumanEval | GPQA | AIME |
|---|---|---|---|---|---|---|
| GLM-5 | MIT | 744B / 40B | 77.8% | 90.0% | 86.0% | 84.0% |
| Kimi K2.5 | MIT | 1T / 32B | 76.8% | 99.0% | 87.6% | 96.1% |
| GLM-4.7 | MIT | 355B / 32B | 73.8% | 94.2% | 85.7% | 95.7% |
GLM-4.7 (4x H100 80GB, 320GB total): MIT-licensed, 355B / 32B active parameters. The most accessible server-tier model — if you already have H100 infrastructure, this is the practical starting point with the best balance of capability and deployability.
Kimi K2.5 (4x H200 141GB, 564GB total): MIT-licensed, 1T / 32B active parameters. Leads this entire list on code generation. Choose it when your cluster can support the H200 requirement and code quality is the top priority.
GLM-5 (4x H200 141GB, 564GB total): MIT-licensed, 744B / 40B active parameters. Leads on SWE-bench — choose it when software-engineering performance matters more than hardware efficiency.
| Model | License | Params | SWE-bench | HumanEval | GPQA | AIME |
|---|---|---|---|---|---|---|
| MiniMax M2.5 | Apache 2.0 | 230B / 10B | 80.2% | 89.6% | 85.2% | 86.3% |
| Qwen3-235B-A22B | Apache 2.0 | 235B / 22B | N/A | N/A | 71.1% | 81.5% |
| Step-3.5-Flash | Apache 2.0 | 196B / 11B | 74.4% | 81.1% | N/A | 99.8% |
| Devstral-2-123B | Modified MIT | 123B / 123B | 72.2% | N/A | N/A | N/A |
| Qwen3-Coder-Next | Apache 2.0 | 80B / 3B | 70.6% | 94.1% | 53.4% | 89.2% |
MiniMax M2.5: Apache 2.0, 230B / 10B active parameters. At 80.2% SWE-bench it matches the top proprietary models on bug-fixing, with MoE keeping inference efficient. The strongest all-round pick at this tier.
Qwen3-235B-A22B: Apache 2.0, 235B / 22B active parameters. The best reasoning-focused model here, particularly strong on science and math. A good fit when you need a large capable model without evaluating proprietary licensing.
Step-3.5-Flash: Apache 2.0, 196B / 11B active parameters, 74.4% SWE-bench and 99.8% AIME. The standout choice when your workload is heavy on numerical reasoning, financial analysis, or algorithmic tasks.
Devstral-2-123B: Modified MIT, 123B dense parameters, 72.2% SWE-bench. A coding-specialist option for teams that want a dense model optimized specifically for software engineering.
Single H100 deployments are the most common enterprise self-hosting configuration. At $15K-$25K for a used/lease H100, these models represent strong cost-performance for teams running inference workloads.
| Model | License | Params | SWE-bench | HumanEval | GPQA | AIME |
|---|---|---|---|---|---|---|
| Qwen3.5-122B-A10B | Apache 2.0 | 122B / 10B | N/A | N/A | N/A | N/A |
| DeepSeek V3 | Unlicensed | 671B / 37B | 38.8% | N/A | 68.4% | N/A |
| GPT-oss 120B | Apache 2.0 | 117B / 5.1B | 62.4% | 88.3% | 80.9% | 97.9% |
| DS-R1-Distill-Llama-70B | MIT | 70B / 70B | N/A | 86.0% | 65.2% | 70.0% |
Qwen3.5-122B-A10B: Apache 2.0, 122B total / 10B active parameters via MoE. The top pick for single H100 deployments — you get a 122B-parameter model's knowledge at the inference cost of a 10B model.
DeepSeek V3: Public weights, non-standard license, 671B / 37B active parameters. Fits on a single H100 with aggressive INT4 quantization. A strong option if you are comfortable with the setup requirements — review the license terms before production deployment.
GPT-oss 120B: Apache 2.0, 117B / 5.1B active parameters, 62.4% SWE-bench, 88.3% HumanEval, 80.9% GPQA, and 97.9% AIME. A reliable fallback with strong benchmark data for teams that want proven numbers before committing to newer models.
The RTX 4090 has become the benchmark consumer GPU for LLM enthusiasts and small teams. At 24GB VRAM, it runs 27B-32B dense models in INT4 quantization comfortably.
| Model | License | Params | Key Strength |
|---|---|---|---|
| Qwen3.5-27B | Apache 2.0 | 27B | Best general |
| DS-R1-Distill-Qwen-32B | MIT | 32B | Strong reasoning |
Qwen3.5-27B: Apache 2.0, 27B parameters, runs on a single RTX 4090 in INT4. The default pick for RTX 4090 owners — latest generation Qwen with broad capability across coding, reasoning, and everyday tasks.
DS-R1-Distill-Qwen-32B: MIT, 32B parameters, 85.4% HumanEval, 62.1% GPQA, and 72.0% AIME. Distilled from DeepSeek R1, it punches above its weight on math and logic tasks — the better choice when reasoning depth is the priority.
The most accessible tier for individual developers and small teams. The Qwen3.5 small model family is best-in-class here, offering strong coding and reasoning performance on consumer hardware that costs $400-$700.
| Model | License | Params | VRAM (INT4) | Runs on |
|---|---|---|---|---|
| Qwen3.5-9B | Apache 2.0 | 9B | 5GB | RTX 3090+ |
| Qwen3.5-4B | Apache 2.0 | 4B | 2GB | RTX 3060+ |
Qwen3.5-9B: Apache 2.0, 9B parameters, only 5GB VRAM at INT4. The best daily-driver for RTX 3090 owners — capable across coding assistance, document Q&A, and reasoning tasks.
Qwen3.5-4B: Apache 2.0, 4B parameters, 2GB VRAM at INT4. The right pick when hardware is the hard constraint — runs on an RTX 3060 or smaller and delivers surprisingly capable performance for its size.
| Your Situation | Best Model | License | Hardware |
|---|---|---|---|
| Frontier performance, enterprise cluster | Kimi K2.5 | MIT | 4x H200 141GB |
| Best SWE-bench on available H100 cluster | GLM-4.7 | MIT | 4x H100 80GB |
| Best SWE-bench, 4x H200 cluster | GLM-5 | MIT | 4x H200 141GB |
| Best bug-fixing at dual-GPU | MiniMax M2.5 | Apache 2.0 | 2x H100 80GB |
| Best math at dual-GPU | Step-3.5-Flash | Apache 2.0 | 2x H100 80GB |
| Best single H100 model | Qwen3.5-122B-A10B | Apache 2.0 | 1x H100 80GB |
| Best reasoning on single H100 | DS-R1-Distill-Llama-70B | MIT | 1x H100 80GB |
| Best RTX 4090 general model | Qwen3.5-27B | Apache 2.0 | 1x RTX 4090 |
| Best RTX 4090 reasoning | DS-R1-Distill-Qwen-32B | MIT | 1x RTX 4090 |
| RTX 3090 / budget GPU | Qwen3.5-9B | Apache 2.0 | 1x RTX 3090 |
| Minimal hardware / RTX 3060 | Qwen3.5-4B | Apache 2.0 | 1x RTX 3060 |
Self-hosting solves the inference problem, not the retrieval and permissions problem. Most teams still need a way to connect the model to internal knowledge sources and keep access controls intact.
Onyx is relevant here as the application layer around a self-hosted model. You can point Onyx at a vLLM or Ollama deployment, connect sources like Slack, Confluence, Jira, and GitHub, and keep permission-aware search and chat in front of users. That makes the hardware choices in this guide more actionable: the model handles inference, while Onyx handles retrieval, orchestration, and team-facing access.
What is the best self-hosted LLM in 2026?
In this leaderboard snapshot, the best self-hosted LLM depends mostly on hardware tier. On enterprise GPU clusters, Kimi K2.5 leads HumanEval, GLM-4.7 is the practical 4x H100 option, and GLM-5 has the strongest SWE-bench result in this guide. On a single H100, Qwen3.5-122B-A10B is the top pick, with DeepSeek V3 as an alternative for teams comfortable with heavy quantization. On an RTX 4090, Qwen3.5-27B is the general-purpose pick and DS-R1-Distill-Qwen-32B is the reasoning-focused option. For budget hardware, Qwen3.5-9B runs on an RTX 3090 and Qwen3.5-4B runs on an RTX 3060 or smaller.
Can I run a frontier-class LLM on a single GPU?
Yes. GPT-oss 120B achieves 62.4% SWE-bench on a single H100 80GB. On an RTX 4090, DS-R1-Distill-Qwen-32B delivers 72% AIME and 62.1% GPQA Diamond, capabilities that would have required a multi-GPU cluster two years ago. The DeepSeek R1 distilled models represent the most significant advance in single-GPU performance.
What tools do I use to run these models locally?
The most common inference runtimes are Ollama (easiest setup for consumer hardware), vLLM (best throughput for production deployments), LM Studio (GUI-based for Mac/Windows), and llama.cpp (CPU and low-VRAM support). Most models in this guide are available through Ollama or HuggingFace directly.
Which self-hosted LLMs support air-gapped deployment?
All models in this guide can run in fully air-gapped environments after the initial model weight download. MIT and Apache 2.0 licensed models have no phone-home requirements or usage telemetry by default. For regulated environments (ITAR, HIPAA, FedRAMP), open-weight models on private infrastructure provide a compliant self-hosting path.
How do I estimate self-hosting cost versus API costs?
A single H100 80GB server node (cloud or on-premise) runs approximately $2-4/hour on cloud or $15K-$25K to purchase. GPT-oss 120B running on that node can serve roughly 1-5 million tokens per day depending on workload. At DeepSeek's $0.28/M input price, you break even in roughly 2-3 months at high utilization. For teams with consistent 24/7 inference demand, self-hosting is typically 5-10x cheaper over a 2-year horizon. See Best Open Source LLMs 2026 for a full comparison of open-source model licenses, or Best LLMs for Coding 2026 for coding-focused rankings including API options.
What is the best model for my specific hardware?
It depends on your GPU's VRAM and whether you are using FP16 or INT4 quantization. The Onyx LLM Hardware Requirements Calculator lets you enter your hardware and instantly see which models fit, including VRAM requirements and recommended quantization settings for each model in this guide.