All Insights

Self-Hosted AI12 min readPublished May 6, 2026Last updated May 7, 2026

Best Self-Hosted LLMs in 2026

By Roshan Desai

Self-hosting an LLM used to mean accepting significantly worse performance than the proprietary APIs. In 2026, that's changed. The best self-hosted models now approach proprietary APIs much more closely on coding benchmarks, but they require real hardware to run them. The question isn't just capability anymore. It's whether the economics make sense for your team, and which model fits your GPU setup.

This guide covers the top 10 self-hosted LLMs from the Onyx Self-Hosted LLM Leaderboard, organized by the hardware tier they require. Data is updated as of March 12, 2026.

How this guide is sourced: Hardware, benchmark, and license data comes from the Onyx Self-Hosted LLM Leaderboard. The recommendations in each section interpret that data for specific deployment tiers.

TL;DR: The right self-hosted model depends almost entirely on what GPU you have. GLM-4.7 is the entry point for cluster deployments at 4x H100 (320GB), Kimi K2.5 needs 4x H200 (564GB) but leads on code generation, and GLM-5 needs 4x H200 (564GB) for the strongest bug-fixing performance. A single H100 gets you Qwen3.5-122B-A10B, a 122B MoE model that runs efficiently via only 10B active parameters, or DeepSeek V3 with aggressive quantization. On an RTX 4090, Qwen3.5-27B is the general-purpose pick and DS-R1-Distill-Qwen-32B is the reasoning pick. For budget GPUs, Qwen3.5-9B runs on an RTX 3090 and Qwen3.5-4B runs on an RTX 3060 or smaller.

What Is a Self-Hosted LLM?

A self-hosted LLM is a model you run on hardware you control rather than calling via an API. Your data stays on your infrastructure, you don't pay per token, and you can customize the model for your specific use case.

Reasons teams choose to self-host:

Data privacy: Sensitive documents, code, and queries never leave your infrastructure
Cost at scale: At high query volumes, self-hosting is cheaper than API pricing
Air-gapped environments: Defense, healthcare, and regulated industries requiring offline deployment
Model customization: Fine-tuning and LoRA adapters for domain-specific tasks
Latency control: Predictable inference performance without rate limits

What Is Onyx?

Onyx is an open-source AI platform that connects self-hosted LLMs to company knowledge, permissions, and workplace workflows. The model runtime handles inference, while Onyx provides connectors, search, chat, agents, deep research, citations, and a governed user interface.

For self-hosted model deployments, Onyx is the layer that makes the model useful to a team instead of only to the infrastructure group. It can connect a local model endpoint to Slack, Confluence, Jira, Google Drive, SharePoint, GitHub, and other sources while preserving source-system access controls.

Hardware Tiers

Tier	Hardware	Total VRAM	Approximate Cost	Best For
4x H200 cluster	4x H200 141GB	564GB	$150K-$250K	Large MoE models up to ~750B
4x H100 cluster	4x H100 80GB	320GB	$60K-$100K	Large MoE models
Dual H100	2x H100 80GB	160GB	$30K-$50K	200B+ MoE, 120B dense
Single H100	1x H100 80GB	80GB	$15K-$25K	70B-120B models
RTX 4090	1x RTX 4090 24GB	24GB	$1,500-$2,000	27B-32B models in INT4
RTX 3090	1x RTX 3090 24GB	24GB	$400-$700	7B-14B models
RTX 3060	1x RTX 3060 12GB	12GB	$250-$400	4B-7B models

Self-Hosted Models at a Glance

4x H200 (564GB): Kimi K2.5, GLM-5

4x H100 (320GB): GLM-4.7

Single H100: Qwen3.5-122B-A10B, DeepSeek V3, GPT-oss 120B

Dual GPU (2x H100 / 2x A100): MiniMax M2.5, Qwen3-235B-A22B, Step-3.5-Flash

RTX 4090: Qwen3.5-27B, DS-R1-Distill-Qwen-32B

RTX 3090: Qwen3.5-9B

RTX 3060 and smaller: Qwen3.5-4B

Best Self-Hosted LLMs by Hardware Tier

Server Cluster Tier (4x H100 to 4x H200)

These models require multi-GPU inference but deliver performance that matches proprietary frontier models. Hardware requirements vary within this tier: GLM-4.7 fits on 4x H100 80GB (320GB total), while Kimi K2.5 and GLM-5 need 4x H200 141GB (564GB total). Recommended for enterprise teams that already have GPU clusters or are willing to invest in the hardware to avoid API costs long-term.

Model	License	Params (Total/Active)	SWE-bench	HumanEval	GPQA	AIME
GLM-5	MIT	744B / 40B	77.8%	90.0%	86.0%	84.0%
Kimi K2.5	MIT	1T / 32B	76.8%	99.0%	87.6%	96.1%
GLM-4.7	MIT	355B / 32B	73.8%	94.2%	85.7%	95.7%

GLM-4.7 (4x H100 80GB, 320GB total): MIT-licensed, 355B / 32B active parameters. The most accessible server-tier model — if you already have H100 infrastructure, this is the practical starting point with the best balance of capability and deployability.

Kimi K2.5 (4x H200 141GB, 564GB total): MIT-licensed, 1T / 32B active parameters. Leads this entire list on code generation. Choose it when your cluster can support the H200 requirement and code quality is the top priority.

GLM-5 (4x H200 141GB, 564GB total): MIT-licensed, 744B / 40B active parameters. Leads on SWE-bench — choose it when software-engineering performance matters more than hardware efficiency.

Dual GPU Tier (2x H100 / 2x A100)

Model	License	Params	SWE-bench	HumanEval	GPQA	AIME
MiniMax M2.5	Apache 2.0	230B / 10B	80.2%	89.6%	85.2%	86.3%
Qwen3-235B-A22B	Apache 2.0	235B / 22B	N/A	N/A	71.1%	81.5%
Step-3.5-Flash	Apache 2.0	196B / 11B	74.4%	81.1%	N/A	99.8%
Devstral-2-123B	Modified MIT	123B / 123B	72.2%	N/A	N/A	N/A
Qwen3-Coder-Next	Apache 2.0	80B / 3B	70.6%	94.1%	53.4%	89.2%

MiniMax M2.5: Apache 2.0, 230B / 10B active parameters. At 80.2% SWE-bench it matches the top proprietary models on bug-fixing, with MoE keeping inference efficient. The strongest all-round pick at this tier.

Qwen3-235B-A22B: Apache 2.0, 235B / 22B active parameters. The best reasoning-focused model here, particularly strong on science and math. A good fit when you need a large capable model without evaluating proprietary licensing.

Step-3.5-Flash: Apache 2.0, 196B / 11B active parameters, 74.4% SWE-bench and 99.8% AIME. The standout choice when your workload is heavy on numerical reasoning, financial analysis, or algorithmic tasks.

Devstral-2-123B: Modified MIT, 123B dense parameters, 72.2% SWE-bench. A coding-specialist option for teams that want a dense model optimized specifically for software engineering.

Single H100 Tier (1x H100 80GB)

Single H100 deployments are the most common enterprise self-hosting configuration. At $15K-$25K for a used/lease H100, these models represent strong cost-performance for teams running inference workloads.

Model	License	Params	SWE-bench	HumanEval	GPQA	AIME
Qwen3.5-122B-A10B	Apache 2.0	122B / 10B	N/A	N/A	N/A	N/A
DeepSeek V3	Unlicensed	671B / 37B	38.8%	N/A	68.4%	N/A
GPT-oss 120B	Apache 2.0	117B / 5.1B	62.4%	88.3%	80.9%	97.9%
DS-R1-Distill-Llama-70B	MIT	70B / 70B	N/A	86.0%	65.2%	70.0%

Qwen3.5-122B-A10B: Apache 2.0, 122B total / 10B active parameters via MoE. The top pick for single H100 deployments — you get a 122B-parameter model's knowledge at the inference cost of a 10B model.

DeepSeek V3: Public weights, non-standard license, 671B / 37B active parameters. Fits on a single H100 with aggressive INT4 quantization. A strong option if you are comfortable with the setup requirements — review the license terms before production deployment.

GPT-oss 120B: Apache 2.0, 117B / 5.1B active parameters, 62.4% SWE-bench, 88.3% HumanEval, 80.9% GPQA, and 97.9% AIME. A reliable fallback with strong benchmark data for teams that want proven numbers before committing to newer models.

RTX 4090 Tier (24GB VRAM)

The RTX 4090 has become the benchmark consumer GPU for LLM enthusiasts and small teams. At 24GB VRAM, it runs 27B-32B dense models in INT4 quantization comfortably.

Model	License	Params	Key Strength
Qwen3.5-27B	Apache 2.0	27B	Best general
DS-R1-Distill-Qwen-32B	MIT	32B	Strong reasoning

Qwen3.5-27B: Apache 2.0, 27B parameters, runs on a single RTX 4090 in INT4. The default pick for RTX 4090 owners — latest generation Qwen with broad capability across coding, reasoning, and everyday tasks.

DS-R1-Distill-Qwen-32B: MIT, 32B parameters, 85.4% HumanEval, 62.1% GPQA, and 72.0% AIME. Distilled from DeepSeek R1, it punches above its weight on math and logic tasks — the better choice when reasoning depth is the priority.

Budget Tier (RTX 3090 / RTX 3060 and smaller)

The most accessible tier for individual developers and small teams. The Qwen3.5 small model family is best-in-class here, offering strong coding and reasoning performance on consumer hardware that costs $400-$700.

Model	License	Params	VRAM (INT4)	Runs on
Qwen3.5-9B	Apache 2.0	9B	5GB	RTX 3090+
Qwen3.5-4B	Apache 2.0	4B	2GB	RTX 3060+

Qwen3.5-9B: Apache 2.0, 9B parameters, only 5GB VRAM at INT4. The best daily-driver for RTX 3090 owners — capable across coding assistance, document Q&A, and reasoning tasks.

Qwen3.5-4B: Apache 2.0, 4B parameters, 2GB VRAM at INT4. The right pick when hardware is the hard constraint — runs on an RTX 3060 or smaller and delivers surprisingly capable performance for its size.

Choosing the Right Self-Hosted Model

Your Situation	Best Model	License	Hardware
Frontier performance, enterprise cluster	Kimi K2.5	MIT	4x H200 141GB
Best SWE-bench on available H100 cluster	GLM-4.7	MIT	4x H100 80GB
Best SWE-bench, 4x H200 cluster	GLM-5	MIT	4x H200 141GB
Best bug-fixing at dual-GPU	MiniMax M2.5	Apache 2.0	2x H100 80GB
Best math at dual-GPU	Step-3.5-Flash	Apache 2.0	2x H100 80GB
Best single H100 model	Qwen3.5-122B-A10B	Apache 2.0	1x H100 80GB
Best reasoning on single H100	DS-R1-Distill-Llama-70B	MIT	1x H100 80GB
Best RTX 4090 general model	Qwen3.5-27B	Apache 2.0	1x RTX 4090
Best RTX 4090 reasoning	DS-R1-Distill-Qwen-32B	MIT	1x RTX 4090
RTX 3090 / budget GPU	Qwen3.5-9B	Apache 2.0	1x RTX 3090
Minimal hardware / RTX 3060	Qwen3.5-4B	Apache 2.0	1x RTX 3060

Connecting Self-Hosted LLMs to Your Company's Data

Self-hosting solves the inference problem, not the retrieval and permissions problem. Most teams still need a way to connect the model to internal knowledge sources and keep access controls intact.

Onyx is relevant here as the application layer around a self-hosted model. You can point Onyx at a vLLM or Ollama deployment, connect sources like Slack, Confluence, Jira, and GitHub, and keep permission-aware search and chat in front of users. That makes the hardware choices in this guide more actionable: the model handles inference, while Onyx handles retrieval, orchestration, and team-facing access.

Recommended Self-Hosted LLM Stack

Hardware tier	Recommended stack	Best fit
RTX 3090 or RTX 4090	Ollama or LM Studio + Onyx Community	Small teams testing private document Q&A
Single H100	vLLM + Onyx self-hosted	Production chat and search for a department
Multi-GPU cluster	vLLM or SGLang + Onyx Enterprise	High-throughput company-wide deployment
Air-gapped environment	Local model weights + Onyx self-hosted + offline connector procedures	Defense, regulated, and sovereignty-driven teams

The model runtime should not be the user experience. Put a governed layer like Onyx in front of the model so employees can search internal knowledge, see citations, and use the same permissions they already have in source systems.

Frequently Asked Questions

What is the best self-hosted LLM in 2026?

In this leaderboard snapshot, the best self-hosted LLM depends mostly on hardware tier. On enterprise GPU clusters, Kimi K2.5 leads HumanEval, GLM-4.7 is the practical 4x H100 option, and GLM-5 has the strongest SWE-bench result in this guide. On a single H100, Qwen3.5-122B-A10B is the top pick, with DeepSeek V3 as an alternative for teams comfortable with heavy quantization. On an RTX 4090, Qwen3.5-27B is the general-purpose pick and DS-R1-Distill-Qwen-32B is the reasoning-focused option. For budget hardware, Qwen3.5-9B runs on an RTX 3090 and Qwen3.5-4B runs on an RTX 3060 or smaller.

Can I run a frontier-class LLM on a single GPU?

Yes. GPT-oss 120B achieves 62.4% SWE-bench on a single H100 80GB. On an RTX 4090, DS-R1-Distill-Qwen-32B delivers 72% AIME and 62.1% GPQA Diamond, capabilities that would have required a multi-GPU cluster two years ago. The DeepSeek R1 distilled models represent the most significant advance in single-GPU performance.

What tools do I use to run these models locally?

The most common inference runtimes are Ollama (easiest setup for consumer hardware), vLLM (best throughput for production deployments), LM Studio (GUI-based for Mac/Windows), and llama.cpp (CPU and low-VRAM support). Most models in this guide are available through Ollama or HuggingFace directly.

Which self-hosted LLMs support air-gapped deployment?

All models in this guide can run in fully air-gapped environments after the initial model weight download. MIT and Apache 2.0 licensed models have no phone-home requirements or usage telemetry by default. For regulated environments (ITAR, HIPAA, FedRAMP), open-weight models on private infrastructure provide a compliant self-hosting path.

How do I estimate self-hosting cost versus API costs?

A single H100 80GB server node (cloud or on-premise) runs approximately $2-4/hour on cloud or $15K-$25K to purchase. GPT-oss 120B running on that node can serve roughly 1-5 million tokens per day depending on workload. At DeepSeek's $0.28/M input price, you break even in roughly 2-3 months at high utilization. For teams with consistent 24/7 inference demand, self-hosting is typically 5-10x cheaper over a 2-year horizon. See Best Open Source LLMs 2026 for a full comparison of open-source model licenses, or Best LLMs for Coding 2026 for coding-focused rankings including API options.

What is the best model for my specific hardware?

It depends on your GPU's VRAM and whether you are using FP16 or INT4 quantization. The Onyx LLM Hardware Requirements Calculator lets you enter your hardware and instantly see which models fit, including VRAM requirements and recommended quantization settings for each model in this guide.

Related Insights

Self-Hosted AI

14 min read

Sovereign AI in 2026: What It Is, Why It Matters, and 8 Platforms Powering the Sovereign AI Stack

What sovereign AI means, why enterprises and governments are investing billions in it, and the 8 platforms that make up a sovereign AI stack: from Onyx and Mistral to sovereign clouds and local inference.

Roshan Desai

Jul 14, 2026

Self-Hosted AIEnterprise Search

18 min read

Self-Hosted RAG in 2026: The Complete Guide to Running Retrieval-Augmented Generation On Your Own Infrastructure

How to deploy retrieval-augmented generation on your own infrastructure. Compare self-hosted RAG platforms (Onyx, Open WebUI, LibreChat, AnythingLLM, RAGFlow, Verba) and frameworks (LlamaIndex, LangChain, Haystack), with recommended stacks for small, mid-market, and air-gapped enterprise deployments.

OpenWebUI vs. LibreChat vs. Onyx: The Complete Comparison (2026)

Compare OpenWebUI, LibreChat, and Onyx side by side. Features, architecture, enterprise readiness, and which open-source AI platform fits your team.

Roshan Desai

May 7, 2026