Here’s the question I get asked constantly: “Should I self-host Llama or just use Claude API?” The people asking have usually done back-of-napkin math and think they’re about to save a fortune. Sometimes they’re right. Often they’ve missed half the costs. Let me show you how to actually calculate self-hosting LLM cost against paying for a managed API — with real numbers, real hardware, and the failure modes that will surprise you at 2am on a Tuesday.
This isn’t a philosophical debate about open-source versus proprietary. It’s a financial and engineering tradeoff that depends entirely on your workload shape — request volume, latency requirements, context window size, and how much infrastructure you actually want to own. Let’s get into it.
What You’re Actually Paying for With Claude API
Anthropic charges per token, in and out, at different rates depending on the model. At time of writing, Claude Haiku 3.5 runs at roughly $0.80 per million input tokens and $4 per million output tokens. Claude Sonnet 3.5 is approximately $3 per million input / $15 per million output. Claude Opus is significantly more expensive and most teams hit it rarely.
For a realistic workload — say, a customer support agent processing 10,000 requests per day, averaging 500 input tokens and 200 output tokens per request — the daily math looks like this:
# Daily cost estimate for Claude Haiku 3.5
requests_per_day = 10_000
avg_input_tokens = 500
avg_output_tokens = 200
input_cost_per_million = 0.80
output_cost_per_million = 4.00
daily_input_tokens = requests_per_day * avg_input_tokens # 5,000,000
daily_output_tokens = requests_per_day * avg_output_tokens # 2,000,000
daily_cost = (
(daily_input_tokens / 1_000_000) * input_cost_per_million +
(daily_output_tokens / 1_000_000) * output_cost_per_million
)
print(f"Daily cost: ${daily_cost:.2f}") # $4.00 + $8.00 = $12.00
print(f"Monthly cost: ${daily_cost * 30:.2f}") # ~$360/month
$360/month for 300,000 requests is genuinely cheap when you factor in zero infrastructure, zero scaling headaches, and access to a model that performs exceptionally well out of the box. The per-token pricing feels scary until you run the actual numbers.
The Hidden Claude API Advantages Nobody Puts in the Spreadsheet
No cold starts. No GPU driver issues. No CUDA version hell. No waking up to an OOM-killed inference server. Anthropic handles rate limiting, redundancy, and model updates. You also get context windows up to 200k tokens without needing to think about memory pressure. If your use case involves long documents or complex agent loops, that matters enormously and is very difficult to replicate cheaply on self-hosted hardware.
The Real Cost of Self-Hosting: Hardware First
Self-hosting LLM cost breaks down into three buckets: compute, storage/networking, and engineering time. Most people only count the first one.
GPU Options and What They Actually Cost
Running a 7B parameter model like Mistral-7B or Llama-3.1-8B in 4-bit quantization requires roughly 5-6GB of VRAM. An NVIDIA RTX 4090 (24GB VRAM) can comfortably run a 7B model and handle a moderate request queue. Here’s what that looks like across deployment options:
- Renting an A100 (80GB) on Lambda Labs: ~$1.99/hour = ~$1,433/month if you run it 24/7. Handles 70B models comfortably.
- Renting an RTX 4090 node on vast.ai: ~$0.35–0.55/hour = ~$252–396/month. Fine for 7B–13B models.
- Bare metal A100 on CoreWeave or similar: ~$2.00–2.50/hour depending on contract. Better for sustained high-throughput.
- Buying an RTX 4090 outright: ~$1,800–2,000 USD. Breaks even versus renting at ~$0.45/hour in roughly 185 days of continuous use. That assumes no power costs, cooling, rack space, or your time.
For a 70B model like Llama-3.1-70B, you need at least two A100s or H100s for decent inference speed — now you’re looking at $3,500–5,000/month in cloud GPU costs just to keep the lights on. The 70B models are where the self-hosting math gets very uncomfortable very fast.
Throughput Comparison: Tokens Per Second at Real Workloads
Raw speed matters if you have latency SLAs or high concurrency. Here’s what you can realistically expect from a single-GPU setup versus Claude API:
- Mistral-7B on RTX 4090 (vLLM, 4-bit): 80–120 tokens/sec for single requests; throughput scales with batching but latency increases
- Llama-3.1-8B on RTX 4090 (vLLM, fp16): 60–90 tokens/sec, roughly similar story
- Qwen2.5-7B on RTX 4090: Similar ballpark — ~70–100 tokens/sec
- Claude Haiku 3.5 via API: Typically 100–180 tokens/sec time-to-first-token is fast, but you’re sharing infrastructure
- Claude Sonnet 3.5 via API: Slower than Haiku, but still competitive with single-GPU self-hosted inference
The self-hosted numbers look comparable until you factor in concurrency. A single RTX 4090 can’t efficiently serve 50 simultaneous users without significant latency spikes. The API scales elastically. You’d need multiple GPUs and a proper load balancer to match that, which multiplies your cost significantly.
Setting Up vLLM for Inference (The Code You Actually Need)
# Install: pip install vllm
from vllm import LLM, SamplingParams
# Load Mistral-7B with 4-bit quantization to fit on a single GPU
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.3",
quantization="awq", # AWQ quantization — faster than GPTQ for inference
dtype="auto",
max_model_len=8192, # Adjust based on your context needs
gpu_memory_utilization=0.90, # Leave some headroom for KV cache
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512,
)
prompts = [
"Summarise the following support ticket: [ticket text here]",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
vLLM is the right choice here over llama.cpp for production inference — it handles continuous batching properly and has a built-in OpenAI-compatible API server you can drop in with minimal changes. Run it as a server with python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 and point your existing OpenAI SDK calls at localhost.
Where Open-Source Models Actually Fall Short
Model quality is the variable that never shows up in the cost spreadsheet but determines whether any of this is worth doing. Being honest about this is important.
Instruction following: Llama-3.1-8B and Mistral-7B are genuinely good for constrained tasks — classification, extraction, summarisation with a fixed schema. They struggle with complex multi-step reasoning, nuanced instruction following, and anything requiring judgment calls. Claude Sonnet/Opus is materially better at these tasks, and pretending otherwise will burn you in production.
Context length: Most 7B models top out at 8k–32k tokens without quality degradation. Claude handles 200k tokens reliably. If your agent workflows involve long document analysis or extended conversation history, this alone might close the debate.
Safety and consistency: Open-source models require more prompt engineering and testing to behave consistently, especially at edge cases. Claude’s built-in alignment means fewer surprises in customer-facing applications. This has a real engineering cost if you’re trying to match behavior quality.
Qwen2.5 is the dark horse here. Alibaba’s Qwen2.5-72B consistently outperforms Llama-3.1-70B on coding benchmarks and several reasoning tasks. If you’re going to run a 70B model, Qwen2.5-72B is worth testing seriously. The 7B variant is also surprisingly capable for its size on multilingual tasks.
The Breakeven Calculation: When Does Self-Hosting Win?
Let’s build a simple breakeven model. Using an RTX 4090 rental at $0.50/hour ($360/month) running Mistral-7B:
# Breakeven analysis: self-hosted (single RTX 4090) vs Claude Haiku
gpu_monthly_cost = 360 # $0.50/hr * 720 hours
# Assume the self-hosted model handles your workload
# At what monthly API spend does self-hosting break even?
# Factor in engineering overhead (conservative estimate)
engineering_hours_per_month = 5 # monitoring, updates, incident response
engineer_hourly_rate = 100 # USD, conservative
engineering_cost = engineering_hours_per_month * engineer_hourly_rate # $500
total_self_hosted_monthly = gpu_monthly_cost + engineering_cost # $860
# So you need to be spending MORE than $860/month on Claude API
# for self-hosting a 7B model to be financially justified
# At $12/day on Haiku (our earlier example), that's ~$360/month
# Self-hosting at $860/month would be MORE expensive in this case
print(f"Self-hosted monthly (GPU + eng time): ${total_self_hosted_monthly}")
print(f"Claude Haiku monthly (300k requests): $360")
print(f"Self-hosting wins if Haiku costs exceed: ${total_self_hosted_monthly}/month")
# You'd need roughly 720,000+ requests/month on Haiku before self-hosting
# a single 4090 starts to make financial sense — assuming comparable quality
The breakeven point for a 7B model on a single consumer-grade GPU is roughly $800–1,000/month in API spend. Below that, you’re probably paying more to self-host when you honestly account for engineering time. Above $1,500–2,000/month, self-hosting starts to look compelling — if the model quality meets your needs.
Data Privacy and Compliance: The Non-Cost Reason to Self-Host
There’s one argument for self-hosting that has nothing to do with cost: data residency and compliance. If you’re processing medical records, financial data, or anything under GDPR/HIPAA that you’re not comfortable sending to a third-party API, self-hosting is sometimes the only option regardless of economics. This is where self-hosting 7B models makes sense even at lower volumes — not because it’s cheaper, but because it’s the only path that satisfies your legal team.
Anthropic does offer enterprise agreements with data handling commitments, but that’s a different conversation from the standard API.
When to Use What: Clear Recommendations
Stick With Claude API If:
- Your monthly API bill is under $1,000
- You need 100k+ token context windows
- Quality and consistency matter more than cost
- You’re a solo founder or small team without dedicated DevOps capacity
- Your workload is spiky or unpredictable — you don’t want idle GPU hours
Self-Hosting Makes Sense If:
- Your API spend exceeds $1,500–2,000/month and the task works with 7B–13B model quality
- You have strict data privacy requirements that preclude third-party APIs
- You have sustained, predictable high-volume inference (not bursty)
- You have engineering capacity to maintain the stack — this is not a set-and-forget situation
- You’re running Qwen2.5-72B or Llama-3.1-70B for tasks where 70B quality is genuinely required and volume justifies the hardware cost
The Hybrid Approach (Often Best in Practice)
Many production systems I’ve seen run a 7B self-hosted model for high-volume, low-complexity tasks (classification, routing, extraction) and route complex reasoning tasks to Claude Sonnet or Opus. This keeps self-hosting LLM cost manageable while preserving quality where it counts. Use a router layer that classifies request complexity and sends it to the right model — the engineering investment pays for itself quickly at moderate scale.
The honest answer is that self-hosting makes financial sense later than most people think, for fewer use cases than the open-source evangelists suggest — but when it does make sense, it makes sense clearly. Run your actual numbers, include engineering time, and benchmark on your real task before committing. The GPU rental market is competitive enough that you can prototype cheaply on vast.ai before signing any contracts.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

