Open-source vs proprietary LLMs — when to pick which (Llama, Mistral, GPT, Claude)
The default in 2026: start with a hosted proprietary model (Azure OpenAI / OpenAI / Anthropic), move to open-source ONLY when you have a measurable, specific reason. Most teams over-index on open-source for ideological reasons and waste months on infrastructure.
The two camps
| Proprietary (hosted) | Open-source (self-host) | |
|---|---|---|
| Examples | GPT-4o, Claude 3.5, Gemini 1.5 | Llama 3.1, Mistral, Qwen, Phi, Gemma |
| Access | API call | Download weights + run on your GPUs |
| Latest quality | Day 1 — the bleeding edge | Often 6-12 months behind on benchmarks |
| Operational burden | Zero | Significant — GPUs, serving, scaling |
| Cost model | Per-token API | Fixed GPU + electricity |
| Privacy | Data leaves your tenant (mostly fine with Azure OpenAI) | Fully on-prem possible |
| Fine-tuning | Limited (OpenAI offers it, expensive) | Full control — LoRA, QLoRA, full SFT |
| Right for | Early stage, mainstream use cases | High volume, regulated data, specialized fine-tuning |
When proprietary clearly wins
1. You're early stage / pre-product-market-fit
You don't yet know which model size, which prompt strategy, or which fine-tune (if any) you need. Burning time on local inference infrastructure before you've shipped is premature optimization.
2. Latest capabilities matter
OpenAI / Anthropic / Google ship new features (multi-modal vision, structured output, function calling, computer use) months before open-source catches up. If your product hinges on the bleeding edge, hosted models are essential.
3. Operational team < 3 engineers
Running production LLM inference (KV-cache management, batching, autoscaling on H100 GPUs, monitoring, model updates) is a real engineering discipline. Small teams should not build this from scratch.
4. Variable traffic
Per-token billing scales perfectly with usage. Self-hosting requires GPUs running 24/7 — you pay even at 3 AM.
5. Quality matters more than cost (small to medium volume)
GPT-4o beats Llama 3.1 70B on most general benchmarks. For tasks where quality is the primary concern at modest volume, proprietary wins.
When open-source clearly wins
1. Privacy / data residency requirements
If your data legally cannot leave your network (defense, healthcare, finance in some jurisdictions), self-hosting is the answer. Even with Azure OpenAI in your tenant, some compliance teams demand "no third-party model at all".
2. Cost crossover at high volume
At ~1-5M tokens/day depending on the model, self-hosting becomes cheaper than per-token API billing.
Rough crossover math:
- GPT-4o-mini at $0.15/M input + $0.60/M output: ~$1k/month at 10M tokens/day
- Llama 3.1 70B on 2× A100 GPUs: ~$2-3k/month fixed; pays back above ~30M tokens/day
For most apps, proprietary models stay cheaper because you don't have constant traffic.
3. You need fine-tuning beyond what hosted APIs offer
Hosted fine-tuning is limited (small dataset sizes, expensive, less control). Open-source lets you do full SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), or RLHF on your own corpus.
Use cases: domain-specific style (legal, medical), proprietary terminology, classification tasks where a small fine-tuned model beats a generic large one.
4. Latency-critical (sub-100ms)
Network round-trip to a hosted API is 50-200ms. Locally hosted small models can respond in 30ms. For real-time / streaming use cases where latency is critical, local is better.
5. You have ML / platform engineers already
If you already have a platform team that operates Kubernetes + GPUs + observability, the marginal cost of adding LLM inference is low. If you'd be building it from scratch — don't.
The major open-source players (May 2026)
| Model | Sizes | Best for |
|---|---|---|
| Llama 3.1 | 8B / 70B / 405B | General-purpose; the broadest ecosystem |
| Mistral 7B / Mixtral 8x7B / Mistral Large 2 | 7B-123B | Strong all-around; Mixture-of-Experts efficient |
| Qwen 2.5 | 0.5B-72B | Strong in code + multilingual (Mandarin) |
| Phi-3 / Phi-3.5 | 3.8B / 14B | Small models punching above weight |
| Gemma 2 | 9B / 27B | Google's open release |
| DeepSeek V2 / V3 | 16B / 236B (MoE) | Cost-efficient at inference |
| Command R+ | 35B / 104B | Cohere; strong on RAG / agentic |
Inference frameworks for self-hosting
| Tool | Use case |
|---|---|
| vLLM | High-throughput batched serving; PagedAttention for memory efficiency |
| TGI (HuggingFace Text Generation Inference) | Solid alternative to vLLM |
| llama.cpp | CPU + low-end GPU; quantized models (GGUF) |
| Ollama | Easiest local dev; wraps llama.cpp |
| TensorRT-LLM | NVIDIA's optimized engine; fastest on H100s |
| MLC LLM | Mobile / edge inference (iOS, Android) |
For production self-hosting at scale: vLLM is the default choice for GPU serving.
Hybrid is normal
Most serious production systems use BOTH:
- Hosted model (GPT-4o) for high-quality user-facing queries
- Smaller fine-tuned local model for high-volume backend tasks (classification, extraction, retrieval scoring)
- Cascade pattern — try cheap small model first, fall back to expensive large model only if needed
def answer(question: str) -> str:
# Try cheap local first
small_answer = local_small_model.generate(question)
if local_small_model.confidence(small_answer) > 0.85:
return small_answer
# Fall back to GPT-4o
return gpt4o.generate(question)
Cost crossover example
Concrete scenario: customer-support chatbot at 100k queries/day, average 2k tokens in + 500 tokens out per query.
| Approach | Monthly cost |
|---|---|
| GPT-4o | ~$45,000 |
| GPT-4o-mini | ~$1,800 |
| Self-hosted Llama 3.1 70B on 4× A100 | ~$8,000 (fixed) |
| Self-hosted Llama 3.1 8B on 1× A10 | ~$1,500 (fixed) |
For 100k/day, GPT-4o-mini wins on cost AND requires zero ops. Llama 70B becomes interesting only at 5-10× this volume, AND when GPT-4o-mini quality is insufficient.
Common interview traps
"Should we use open-source for privacy?"
Maybe — but Azure OpenAI is in your Azure tenant, data never trains third-party models, region-pinned, audited. For 90% of "privacy" concerns, Azure OpenAI satisfies the requirement without the operational burden of self-hosting. Open-source self-hosting is needed only when "no third-party model at all" is required.
"GPT-4o is expensive — should we switch to Llama?"
Run the numbers. Often switching to GPT-4o-mini (10-30x cheaper than GPT-4o) is the right answer, not Llama. The cost-vs-quality curve has GPT-4o-mini in a sweet spot for most applications.
"We need fine-tuning — must we go open-source?"
Not necessarily. OpenAI, Together AI, Anyscale, Replicate, and others offer hosted fine-tuning. Open-source self-fine-tune wins only when you need DPO/RLHF beyond simple SFT, or when you need full control over the training process.
Interview-grade summary
"Default to hosted proprietary models (GPT-4o-mini, Azure OpenAI, Claude) for ~90% of use cases — better quality, zero ops, latest features, scales perfectly with traffic. Move to open-source (Llama 3.1, Mistral) when you have a specific reason: hard privacy requirements that even Azure OpenAI can't satisfy, cost crossover at very high volume (millions of tokens daily), or fine-tuning needs beyond what hosted APIs support. Most teams self-host too early — running production GPU inference well is its own engineering discipline. Start simple, prove value with hosted models, switch to local only when the math demands it."