All Insights
Mar 2, 2026
Roshan Desai
TL;DR: Running a local LLM is the easy part. Deploying AI for your team requires a full stack: model serving, a chat interface, knowledge retrieval (RAG), connectors to your data, and authentication. This guide maps every layer, compares the DIY approach to integrated platforms like Onyx, and provides recommended configurations for teams of every size.
There's no shortage of tutorials for running an LLM locally. Install Ollama, pull a model, start chatting in your terminal. You can do it in five minutes.
But there's a chasm between "I can run a model on my laptop" and "my team of 50 people can use AI, grounded in our company's knowledge, with proper access controls." That chasm is the self-hosted LLM stack.
Here's what happens when an individual's LLM experiment becomes a team's AI deployment:
This guide maps the complete self-hosted LLM stack for teams and compares your options at each layer.

The foundation of any self-hosted LLM deployment. This layer runs the model on your hardware and exposes an API endpoint.
A good inference engine exposes an OpenAI-compatible API so your platform layer works regardless of which engine you pick. It should support your GPU hardware, handle quantization so you can run large models on smaller cards, and let you swap models without restarting the whole service.
A great inference engine also scales gracefully under concurrent load through continuous batching, so 50 users hitting it at once don't degrade response times for everyone.
Ollama — A popular entry point. One command downloads and runs models. Exposes an OpenAI-compatible API. Bundles llama.cpp under the hood and handles quantization automatically.
vLLM — Built for production inference. Uses PagedAttention to virtually eliminate KV-cache memory waste (from 60-80% in traditional systems to under 4%) and increase throughput by 2-4x compared to prior serving systems. Benchmarks show 793 TPS peak vs. Ollama's 41 TPS on the same hardware (Llama 3.1 8B, A100 GPU).
SGLang — High-performance serving framework from LMSYS (the team behind Chatbot Arena). RadixAttention engine caches repeated prompt prefixes in the KV-cache, making it especially fast for chatbot-style workloads where users share system prompts.
LM Studio — Desktop application with a graphical interface. Users can browse, download, and run models without touching a terminal. Supports Vulkan offloading for machines without dedicated GPUs.
| Feature | Ollama | vLLM | SGLang | LM Studio |
|---|---|---|---|---|
| Setup difficulty | Easy (one command) | Moderate (Python, CUDA) | Moderate (Python, CUDA) | Easy (desktop app) |
| Throughput (peak) | ~41 TPS | ~793 TPS | ~29% faster than vLLM | N/A (desktop use) |
| Concurrent users | Good for <20 | Designed for 20-1000+ | Designed for 20-1000+ | Single user |
| Model hot-swap | Yes (on-the-fly) | No (requires restart) | No (requires restart) | Yes (GUI selection) |
| GPU support | NVIDIA, AMD, Apple Silicon | NVIDIA, AMD, Intel, TPU | NVIDIA, AMD, TPU | NVIDIA, AMD, Apple Silicon, Vulkan |
| API compatibility | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible | OpenAI-compatible |
| Quantization | Automatic | Manual/native | Manual/native | Automatic |
| Open-source | Yes | Yes | Yes | No (proprietary, free to use) |
| Best for | Getting started, small teams | Production, high concurrency | Max throughput, structured output | Exploration, demos |
Use Ollama to start. Switch to vLLM or SGLang when you need production performance at scale. SGLang tends to outperform vLLM on chat workloads and structured output, while vLLM has the more mature ecosystem and broader community support. All three expose OpenAI-compatible APIs, so the platform layer (Onyx, OpenWebUI, LibreChat) works with any of them without changes.
A model API, alone, isn't user-friendly. This layer provides the web-based chat experience your team interacts with.
A good chat interface gives each user their own conversation history, lets them upload documents for ad-hoc context, and connects to multiple backend models.
A great one goes further: shared conversations so teams can collaborate on research, prompt templates and reusable agents that non-technical users can build and share, web search, deep research, and tool use built in, and admin controls for managing users and models from a single dashboard.
Onyx is a full-featured AI platform including chat, agents, web search, image generation, deep research, and MCP. For larger organizations, Onyx goes even further, with 40+ native connectors to your data sources like Drive, Slack, and Confluence, SSO with SCIM to maintain RBACs, and even a Slack bot to meet your team where they work.
OpenWebUI is one of the most popular self-hosted chat interfaces, with 125K+ GitHub stars. It offers a polished, ChatGPT-style UX with conversation history, model selection, and RBAC. Built-in document upload RAG lets users attach files for context, though retrieval reliability is a common pain point. A pipeline architecture allows community-built plugins for custom functionality, and it supports Ollama and any OpenAI-compatible backend out of the box.
LibreChat is a multi-provider chat interface that connects to OpenAI, Anthropic, Google, Azure, and local models through a single UI. It stands out for its MCP-based agent support, built-in code interpreter, and the most mature authentication stack of any open-source chat UI (OAuth, SAML, LDAP, 2FA). LibreChat was acquired by ClickHouse in late 2025, signaling enterprise ambitions.
The difference: OpenWebUI and LibreChat are pure chat interfaces that sit on top of your inference engine. Onyx is a complete platform where you can customize, collaborate with, and extend your AI agents, all in a single deployment.
This is where DIY stacks get complicated. Running a model and providing a chat UI is straightforward. Connecting that model to your company's actual knowledge is hard to get right.
A good knowledge layer connects to the tools your team already uses and keeps that data indexed and up to date.
A great one does it without any custom code. Native connectors that sync in real time, permission inheritance that respects who can see what in the source system, hybrid search that combines keyword and semantic retrieval so results are actually accurate, and cited sources in every response so users can verify and trust the answers.
This layer is the hardest to build yourself and the easiest to underestimate.
The RAG pipeline: To ground AI responses in your data, you need:
The DIY approach: OpenWebUI includes built-in RAG with document upload, an embedded vector database (ChromaDB), and hybrid search. On paper, this covers the basics. In practice, users widely report unreliable retrieval quality, broken hybrid search across versions, ChromaDB scaling issues under concurrent load, and poor default chunking. Many teams end up building custom RAG pipelines anyway. For live enterprise data from tools like Slack, Confluence, or Jira, OpenWebUI has no native connectors. You would need to build custom ETL pipelines or configure community MCP servers, with no permission inheritance from source systems.
The integrated approach: Onyx handles the entire RAG pipeline by default. Its 40+ native connectors pull data from enterprise tools, sync in real-time, and index content automatically. Hybrid search (keyword + semantic), contextual retrieval, and LLM-based knowledge graphs deliver accurate, cited answers without custom engineering. Additionally, Onyx's custom agent harness means complex questions are tackled by a team of AI that divide and conquer, rather than a single search attempt like most other RAG applications.
The layer that separates personal tools from organizational platforms.
A good enterprise layer handles SSO so users log in with existing credentials and RBAC so admins control who sees what.
A great one also inherits permissions from source systems automatically, provides usage analytics for governance, maintains a full audit trail for compliance, and supports white-labeling so the platform feels like an internal tool rather than a third-party product.

Some teams build their own stack by combining individual components. Here's what that looks like:
Typical DIY stack:
Advantages:
Disadvantages:
Realistic assessment: The DIY approach works well for indie hackers and small teams that want customization. It doesn't work well for organizations that want to deploy AI for end users quickly, or for teams without dedicated DevOps/MLOps resources. Even with these resources, it can become a headache to juggle all of the different applications and vendors on top of the development and maintenance.
Onyx takes a different approach. Instead of requiring teams to assemble and maintain a multi-component stack, it provides Layers 2 through 4 in a single deployment, then connects to any Layer 1 inference engine.
What Onyx handles (Layers 2-4):
What you bring:
Deployment: Docker Compose for small to mid-size teams, Kubernetes (Helm chart) for large enterprise deployments. Initial setup to first query takes under an hour.
Pricing: The community edition is free and fully functional (MIT license). Certain features, only needed in large deployments (SSO, RBAC, white-labeling, dedicated support), require the enterprise plan.
One of the primary motivations for self-hosted LLMs is cost control. Here's how the math works:
| Provider | Model | Cost (per 1M tokens) |
|---|---|---|
| OpenAI | GPT-4o | ~$2.50 input / $10 output |
| Anthropic | Claude Sonnet 4 | ~$3 input / $15 output |
| Gemini 2.5 Pro | ~$1.25 input / $10 output | |
| DeepSeek | DeepSeek-V3.2 | ~$0.28 input / $0.42 output |
For a team of 100 users averaging 50 queries/day with ~2,000 tokens per query, monthly API costs range from $1,000-$5,000 depending on the provider and model.
| Setup | Hardware | Approximate Cost | Capacity |
|---|---|---|---|
| Small team (5-20 users) | 1x RTX 4090/5090 (24-32GB) or Mac Studio (M4 Max or newer) | $2,000-$4,000 one-time | 27B-35B models (quantized), adequate for light use |
| Mid-size team (20-200 users) | 1x NVIDIA A100 (80GB) or 2x RTX 4090/5090 | $9,000-$15,000 one-time | 70B-122B MoE models (quantized), moderate concurrent use |
| Large team (200-1000+ users) | 4-8x NVIDIA A100/H100 or cloud GPU instances | $40,000-$200,000 one-time (or $5,000-$12,000/month cloud GPU) | 400B+ MoE models with high concurrency |
Many teams use a hybrid strategy: self-hosted models for routine queries (cost-effective) and cloud APIs for complex tasks requiring frontier models (higher quality). Onyx supports this natively: you can configure different models for different use cases, routing simple queries to a local Qwen 3.5 model and complex research tasks to Opus 4.6 or GPT-5.2.
For a 100-person team, self-hosting typically breaks even within 6-12 months compared to cloud APIs, assuming moderate usage. The bigger the team and the heavier the usage, the faster self-hosting pays for itself.
The platform layer cost also matters: Onyx's community edition is free, so the only incremental costs are infrastructure. Cloud Business plan starts at $20/user/month (annual billing), and custom pricing with volume-based discounts are available for enterprises. Onyx Enterprise Edition is still substantially less than ChatGPT Enterprise ($60+/seat/month), with the added benefit of data connectors, enterprise search, and full self-hosting.

Not every team needs to run models on their own hardware. Managed hosted LLM providers run open-weight models (Qwen, Mistral, DeepSeek) on dedicated infrastructure in their cloud, giving you API access without managing GPUs yourself. You get stronger models on better hardware than most teams can provision internally, plus compliance guarantees like SOC 2 certification, BAAs for healthcare data, and dedicated instances that don't share resources with other customers.
The tradeoff is straightforward: your data leaves your network. These providers offer strong contractual and technical safeguards, but if your threat model requires a fully air-gapped deployment, this isn't it.
Here are the most relevant providers as of early 2026:
Since these providers expose OpenAI-compatible APIs, Onyx works with all of them out of the box. You configure the endpoint and API key, and Onyx handles the rest: routing queries, managing context windows, and orchestrating RAG pipelines against your connected knowledge sources.
Profile: Startup or small engineering team wanting private AI chat with some internal knowledge access.
Recommended stack:
Setup time: ~1 hour Monthly cost: $0 (platform) + existing hardware + optional cloud API budget
Profile: Growing company that needs AI grounded in company knowledge with proper access controls across departments.
Recommended stack:
Setup time: 2-4 hours Monthly cost: Onyx Enterprise (contact sales for pricing) + infrastructure + optional cloud API budget
Profile: Large organization with compliance requirements, multiple departments, and data sensitivity across business units.
Recommended stack:
Setup time: 1-2 days for initial deployment, 1-2 weeks for full rollout with connectors and user onboarding Monthly cost: Onyx Enterprise (contact sales for pricing) + GPU infrastructure + optional cloud API budget
The fastest path from "we want self-hosted AI for our team" to a working deployment:
Install Ollama on a machine with a GPU. Pull a model: ollama pull qwen3.5. Verify it works.
Deploy Onyx via Docker Compose. Point it at your Ollama instance. Connect your first data source (start with Slack or Google Drive, which deliver the most immediate value).
Invite your team. Configure authentication, set up roles, and let users start searching and chatting.
Iterate. Add more connectors, configure agents for specific workflows, deploy Slack/Teams bots for ambient AI access.
Scale. When concurrent usage outgrows Ollama, switch to vLLM or SGLang. When team size requires enterprise controls, upgrade to Onyx Enterprise.
The self-hosted LLM stack in 2026 is production-ready. Thousands of teams have gone from first install to full deployment, and the tooling has caught up to the ambition. The real decision isn't whether self-hosting works, it's how much of the stack you want to assemble yourself versus adopting an integrated platform. If you want the integrated path, get started with Onyx for free and have your first data source connected in under an hour.
Picking the right model for your self-hosted deployment matters as much as picking the right infrastructure. We maintain several live leaderboards to help:
All leaderboards are updated regularly as new models and benchmarks are released.
Yes, and this is the path we recommend. All three engines expose OpenAI-compatible APIs, so the platform layer (Onyx, OpenWebUI, LibreChat) doesn't know or care which one is running underneath. The migration is a config change: deploy the new engine, update the endpoint URL, verify responses look right, decommission Ollama. Chat history, data sources, user accounts, and agent configs stay where they are because they live in the platform, not the engine.
Yes, and most production teams do. The typical setup: a local model like Qwen 3.5 handles routine questions (drafts, summaries, internal Q&A) at near-zero marginal cost, while frontier APIs like Claude or GPT-5.2 handle tasks where quality matters more than cost. Onyx lets you configure this per agent, so a "quick answer" bot uses the local model and a "deep research" agent routes to an API.
A chat interface gives your team a web UI to talk to a model. A platform connects that model to your actual company knowledge. The practical difference: with a chat interface, someone asks "what's our refund policy?" and gets a generic answer. With a platform that has your Confluence and support docs indexed, they get the actual policy with a link to the source document. If your team only needs a private ChatGPT, a chat interface is enough. If they need answers grounded in company data, you need the platform.
You'll need a RAG pipeline: connectors to pull content, a chunking and embedding layer, a vector database, and retrieval logic. Building this yourself is where most DIY projects stall, because each data source has its own API, auth model, and rate limits. Permission inheritance is even harder: making sure the AI doesn't surface Confluence pages that a user can't access in Confluence itself. Onyx provides 40+ native connectors that handle sync, indexing, and permissions automatically. If you want to build it yourself, expect the RAG pipeline to take more engineering time than the model serving.
For the initial deployment, a DevOps engineer who's comfortable with Docker and GPU drivers can get everything running. Most of the work is infrastructure, not ML. Where you start needing ML expertise: tuning retrieval quality (chunking strategies, embedding model selection, reranking), evaluating model outputs systematically, and debugging cases where the AI gives bad answers. If you use an integrated platform, much of that tuning is handled for you. A DIY stack with 6-8 components will keep an engineer busy.
Self-hosting is necessary but not sufficient. Keeping data on your infrastructure checks one box, but compliance auditors will also ask about audit logging, encryption at rest and in transit, data retention policies, access controls, and (for HIPAA) whether you have BAAs with every vendor in the chain. A common gap: teams self-host the LLM but use a cloud embedding API, which means data still leaves the network. Onyx Enterprise includes SOC 2 Type II certification, audit trails, and RBAC. For fully air-gapped environments (defense, healthcare with strict requirements), Onyx supports deployment with zero external network calls.
The biggest adoption killer isn't bad technology, it's asking people to change their workflow. A self-hosted AI that lives at a separate URL that people have to remember to visit will lose to ChatGPT every time. Put the AI where people already are: Slack bots, Teams integrations, browser extensions. Then make sure it can answer questions that generic ChatGPT can't, like "what did the team decide about the Q3 roadmap in last Tuesday's thread?" Once someone gets a useful answer grounded in real company context, they stop going back to ChatGPT on their own.