All Insights

Self-Hosted AI12 min readPublished Mar 8, 2026

Best Self-Hosted LLMs in 2026

Roshan Desai

By Roshan Desai

Self-hosting an LLM used to mean accepting significantly worse performance than the proprietary APIs. In 2026, that's changed. The best self-hosted models now approach proprietary APIs much more closely on coding benchmarks, but they require real hardware to run them. The question isn't just capability anymore. It's whether the economics make sense for your team, and which model fits your GPU setup.

This guide covers the top 10 self-hosted LLMs from the Onyx Self-Hosted LLM Leaderboard, organized by the hardware tier they require. Data is updated as of March 12, 2026.

How this guide is sourced: Hardware, benchmark, and license data comes from the Onyx Self-Hosted LLM Leaderboard. The recommendations in each section interpret that data for specific deployment tiers.


TL;DR: The right self-hosted model depends almost entirely on what GPU you have. GLM-4.7 is the entry point for cluster deployments at 4x H100 (320GB), Kimi K2.5 needs 4x H200 (564GB) but leads on code generation, and GLM-5 needs 4x H200 (564GB) for the strongest bug-fixing performance. A single H100 gets you Qwen3.5-122B-A10B, a 122B MoE model that runs efficiently via only 10B active parameters, or DeepSeek V3 with aggressive quantization. On an RTX 4090, Qwen3.5-27B is the general-purpose pick and DS-R1-Distill-Qwen-32B is the reasoning pick. For budget GPUs, Qwen3.5-9B runs on an RTX 3090 and Qwen3.5-4B runs on an RTX 3060 or smaller.


What Is a Self-Hosted LLM?

A self-hosted LLM is a model you run on hardware you control rather than calling via an API. Your data stays on your infrastructure, you don't pay per token, and you can customize the model for your specific use case.

Reasons teams choose to self-host:

  • Data privacy: Sensitive documents, code, and queries never leave your infrastructure
  • Cost at scale: At high query volumes, self-hosting is cheaper than API pricing
  • Air-gapped environments: Defense, healthcare, and regulated industries requiring offline deployment
  • Model customization: Fine-tuning and LoRA adapters for domain-specific tasks
  • Latency control: Predictable inference performance without rate limits

Hardware Tiers

TierHardwareTotal VRAMApproximate CostBest For
4x H200 cluster4x H200 141GB564GB$150K-$250KLarge MoE models up to ~750B
4x H100 cluster4x H100 80GB320GB$60K-$100KLarge MoE models
Dual H1002x H100 80GB160GB$30K-$50K200B+ MoE, 120B dense
Single H1001x H100 80GB80GB$15K-$25K70B-120B models
RTX 40901x RTX 4090 24GB24GB$1,500-$2,00027B-32B models in INT4
RTX 30901x RTX 3090 24GB24GB$400-$7007B-14B models
RTX 30601x RTX 3060 12GB12GB$250-$4004B-7B models

Self-Hosted Models at a Glance

4x H200 (564GB): Kimi K2.5, GLM-5

4x H100 (320GB): GLM-4.7

Single H100: Qwen3.5-122B-A10B, DeepSeek V3, GPT-oss 120B

Dual GPU (2x H100 / 2x A100): MiniMax M2.5, Qwen3-235B-A22B, Step-3.5-Flash

RTX 4090: Qwen3.5-27B, DS-R1-Distill-Qwen-32B

RTX 3090: Qwen3.5-9B

RTX 3060 and smaller: Qwen3.5-4B


Best Self-Hosted LLMs by Hardware Tier

Server Cluster Tier (4x H100 to 4x H200)

These models require multi-GPU inference but deliver performance that matches proprietary frontier models. Hardware requirements vary within this tier: GLM-4.7 fits on 4x H100 80GB (320GB total), while Kimi K2.5 and GLM-5 need 4x H200 141GB (564GB total). Recommended for enterprise teams that already have GPU clusters or are willing to invest in the hardware to avoid API costs long-term.

ModelLicenseParams (Total/Active)SWE-benchHumanEvalGPQAAIME
GLM-5MIT744B / 40B77.8%90.0%86.0%84.0%
Kimi K2.5MIT1T / 32B76.8%99.0%87.6%96.1%
GLM-4.7MIT355B / 32B73.8%94.2%85.7%95.7%

GLM-4.7 (4x H100 80GB, 320GB total): MIT-licensed, 355B / 32B active parameters. The most accessible server-tier model — if you already have H100 infrastructure, this is the practical starting point with the best balance of capability and deployability.

Kimi K2.5 (4x H200 141GB, 564GB total): MIT-licensed, 1T / 32B active parameters. Leads this entire list on code generation. Choose it when your cluster can support the H200 requirement and code quality is the top priority.

GLM-5 (4x H200 141GB, 564GB total): MIT-licensed, 744B / 40B active parameters. Leads on SWE-bench — choose it when software-engineering performance matters more than hardware efficiency.


Dual GPU Tier (2x H100 / 2x A100)

ModelLicenseParamsSWE-benchHumanEvalGPQAAIME
MiniMax M2.5Apache 2.0230B / 10B80.2%89.6%85.2%86.3%
Qwen3-235B-A22BApache 2.0235B / 22BN/AN/A71.1%81.5%
Step-3.5-FlashApache 2.0196B / 11B74.4%81.1%N/A99.8%
Devstral-2-123BModified MIT123B / 123B72.2%N/AN/AN/A
Qwen3-Coder-NextApache 2.080B / 3B70.6%94.1%53.4%89.2%

MiniMax M2.5: Apache 2.0, 230B / 10B active parameters. At 80.2% SWE-bench it matches the top proprietary models on bug-fixing, with MoE keeping inference efficient. The strongest all-round pick at this tier.

Qwen3-235B-A22B: Apache 2.0, 235B / 22B active parameters. The best reasoning-focused model here, particularly strong on science and math. A good fit when you need a large capable model without evaluating proprietary licensing.

Step-3.5-Flash: Apache 2.0, 196B / 11B active parameters, 74.4% SWE-bench and 99.8% AIME. The standout choice when your workload is heavy on numerical reasoning, financial analysis, or algorithmic tasks.

Devstral-2-123B: Modified MIT, 123B dense parameters, 72.2% SWE-bench. A coding-specialist option for teams that want a dense model optimized specifically for software engineering.


Single H100 Tier (1x H100 80GB)

Single H100 deployments are the most common enterprise self-hosting configuration. At $15K-$25K for a used/lease H100, these models represent strong cost-performance for teams running inference workloads.

ModelLicenseParamsSWE-benchHumanEvalGPQAAIME
Qwen3.5-122B-A10BApache 2.0122B / 10BN/AN/AN/AN/A
DeepSeek V3Unlicensed671B / 37B38.8%N/A68.4%N/A
GPT-oss 120BApache 2.0117B / 5.1B62.4%88.3%80.9%97.9%
DS-R1-Distill-Llama-70BMIT70B / 70BN/A86.0%65.2%70.0%

Qwen3.5-122B-A10B: Apache 2.0, 122B total / 10B active parameters via MoE. The top pick for single H100 deployments — you get a 122B-parameter model's knowledge at the inference cost of a 10B model.

DeepSeek V3: Public weights, non-standard license, 671B / 37B active parameters. Fits on a single H100 with aggressive INT4 quantization. A strong option if you are comfortable with the setup requirements — review the license terms before production deployment.

GPT-oss 120B: Apache 2.0, 117B / 5.1B active parameters, 62.4% SWE-bench, 88.3% HumanEval, 80.9% GPQA, and 97.9% AIME. A reliable fallback with strong benchmark data for teams that want proven numbers before committing to newer models.


RTX 4090 Tier (24GB VRAM)

The RTX 4090 has become the benchmark consumer GPU for LLM enthusiasts and small teams. At 24GB VRAM, it runs 27B-32B dense models in INT4 quantization comfortably.

ModelLicenseParamsKey Strength
Qwen3.5-27BApache 2.027BBest general
DS-R1-Distill-Qwen-32BMIT32BStrong reasoning

Qwen3.5-27B: Apache 2.0, 27B parameters, runs on a single RTX 4090 in INT4. The default pick for RTX 4090 owners — latest generation Qwen with broad capability across coding, reasoning, and everyday tasks.

DS-R1-Distill-Qwen-32B: MIT, 32B parameters, 85.4% HumanEval, 62.1% GPQA, and 72.0% AIME. Distilled from DeepSeek R1, it punches above its weight on math and logic tasks — the better choice when reasoning depth is the priority.


Budget Tier (RTX 3090 / RTX 3060 and smaller)

The most accessible tier for individual developers and small teams. The Qwen3.5 small model family is best-in-class here, offering strong coding and reasoning performance on consumer hardware that costs $400-$700.

ModelLicenseParamsVRAM (INT4)Runs on
Qwen3.5-9BApache 2.09B5GBRTX 3090+
Qwen3.5-4BApache 2.04B2GBRTX 3060+

Qwen3.5-9B: Apache 2.0, 9B parameters, only 5GB VRAM at INT4. The best daily-driver for RTX 3090 owners — capable across coding assistance, document Q&A, and reasoning tasks.

Qwen3.5-4B: Apache 2.0, 4B parameters, 2GB VRAM at INT4. The right pick when hardware is the hard constraint — runs on an RTX 3060 or smaller and delivers surprisingly capable performance for its size.


Choosing the Right Self-Hosted Model

Your SituationBest ModelLicenseHardware
Frontier performance, enterprise clusterKimi K2.5MIT4x H200 141GB
Best SWE-bench on available H100 clusterGLM-4.7MIT4x H100 80GB
Best SWE-bench, 4x H200 clusterGLM-5MIT4x H200 141GB
Best bug-fixing at dual-GPUMiniMax M2.5Apache 2.02x H100 80GB
Best math at dual-GPUStep-3.5-FlashApache 2.02x H100 80GB
Best single H100 modelQwen3.5-122B-A10BApache 2.01x H100 80GB
Best reasoning on single H100DS-R1-Distill-Llama-70BMIT1x H100 80GB
Best RTX 4090 general modelQwen3.5-27BApache 2.01x RTX 4090
Best RTX 4090 reasoningDS-R1-Distill-Qwen-32BMIT1x RTX 4090
RTX 3090 / budget GPUQwen3.5-9BApache 2.01x RTX 3090
Minimal hardware / RTX 3060Qwen3.5-4BApache 2.01x RTX 3060

Connecting Self-Hosted LLMs to Your Company's Data

Self-hosting solves the inference problem, not the retrieval and permissions problem. Most teams still need a way to connect the model to internal knowledge sources and keep access controls intact.

Onyx is relevant here as the application layer around a self-hosted model. You can point Onyx at a vLLM or Ollama deployment, connect sources like Slack, Confluence, Jira, and GitHub, and keep permission-aware search and chat in front of users. That makes the hardware choices in this guide more actionable: the model handles inference, while Onyx handles retrieval, orchestration, and team-facing access.


Frequently Asked Questions

What is the best self-hosted LLM in 2026?

In this leaderboard snapshot, the best self-hosted LLM depends mostly on hardware tier. On enterprise GPU clusters, Kimi K2.5 leads HumanEval, GLM-4.7 is the practical 4x H100 option, and GLM-5 has the strongest SWE-bench result in this guide. On a single H100, Qwen3.5-122B-A10B is the top pick, with DeepSeek V3 as an alternative for teams comfortable with heavy quantization. On an RTX 4090, Qwen3.5-27B is the general-purpose pick and DS-R1-Distill-Qwen-32B is the reasoning-focused option. For budget hardware, Qwen3.5-9B runs on an RTX 3090 and Qwen3.5-4B runs on an RTX 3060 or smaller.

Can I run a frontier-class LLM on a single GPU?

Yes. GPT-oss 120B achieves 62.4% SWE-bench on a single H100 80GB. On an RTX 4090, DS-R1-Distill-Qwen-32B delivers 72% AIME and 62.1% GPQA Diamond, capabilities that would have required a multi-GPU cluster two years ago. The DeepSeek R1 distilled models represent the most significant advance in single-GPU performance.

What tools do I use to run these models locally?

The most common inference runtimes are Ollama (easiest setup for consumer hardware), vLLM (best throughput for production deployments), LM Studio (GUI-based for Mac/Windows), and llama.cpp (CPU and low-VRAM support). Most models in this guide are available through Ollama or HuggingFace directly.

Which self-hosted LLMs support air-gapped deployment?

All models in this guide can run in fully air-gapped environments after the initial model weight download. MIT and Apache 2.0 licensed models have no phone-home requirements or usage telemetry by default. For regulated environments (ITAR, HIPAA, FedRAMP), open-weight models on private infrastructure provide a compliant self-hosting path.

How do I estimate self-hosting cost versus API costs?

A single H100 80GB server node (cloud or on-premise) runs approximately $2-4/hour on cloud or $15K-$25K to purchase. GPT-oss 120B running on that node can serve roughly 1-5 million tokens per day depending on workload. At DeepSeek's $0.28/M input price, you break even in roughly 2-3 months at high utilization. For teams with consistent 24/7 inference demand, self-hosting is typically 5-10x cheaper over a 2-year horizon. See Best Open Source LLMs 2026 for a full comparison of open-source model licenses, or Best LLMs for Coding 2026 for coding-focused rankings including API options.

What is the best model for my specific hardware?

It depends on your GPU's VRAM and whether you are using FP16 or INT4 quantization. The Onyx LLM Hardware Requirements Calculator lets you enter your hardware and instantly see which models fit, including VRAM requirements and recommended quantization settings for each model in this guide.