Sunday, April 5

You’ve got a production agent that’s randomly returning garbage, your costs spiked 3x overnight, and you have no idea which step in the chain is responsible. That’s exactly the moment you realize you need a proper LLM observability platform. This LLM observability platform comparison covers the three tools most teams reach for first — Helicone, LangSmith, and Langfuse — with real assessments of their debugging UIs, data retention policies, alerting, and what breaks when you push them hard.

All three have free tiers and are production-ready, but they’re built for different workflows. I’ll give you a definitive pick at the end rather than the usual cop-out of “it depends on your use case.”

Why LLM Observability Actually Matters in Production

Most teams instrument their agents late — after something breaks in production. By then you’re already flying blind through logs. The core problem with LLM observability isn’t just capturing token counts; it’s correlating inputs, outputs, tool calls, latency, and cost across a multi-step agent run where a failure three steps in can be triggered by a bad prompt two steps back.

If you’re building anything beyond a single-prompt chatbot — multi-agent pipelines, RAG retrieval chains, autonomous workflows — you need span-level tracing, not just request logging. The difference matters enormously when you’re debugging. If you’re already building those systems and haven’t thought about observability for production Claude agents, read that first for the foundational concepts before diving into tooling.

Here’s what a proper trace integration looks like with any of these tools — the structure is consistent even if the SDK calls differ:

import anthropic

# All three platforms use a similar decorator/wrapper pattern
# This example shows the conceptual structure — each SDK wraps your LLM calls
# and captures: input, output, latency, token count, model, metadata

with tracer.start_span("agent_run", metadata={"user_id": "u123", "session": "s456"}) as span:
    # Step 1: retrieval
    with tracer.start_span("rag_retrieval", parent=span) as retrieval_span:
        docs = retrieve_relevant_docs(query)
        retrieval_span.set_attribute("doc_count", len(docs))

    # Step 2: LLM call
    with tracer.start_span("llm_completion", parent=span) as llm_span:
        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            max_tokens=1024,
            messages=[{"role": "user", "content": f"{docs}\n\n{query}"}]
        )
        llm_span.set_attribute("tokens_used", response.usage.input_tokens + response.usage.output_tokens)

Helicone: Best Proxy-Based Integration

Helicone works as a proxy between your app and the LLM provider. You swap your base URL, and every request flows through their infrastructure. No SDK changes, no decorator wrapping — it’s the fastest integration path of the three.

import anthropic

# One-line integration — change the base URL, everything else is captured automatically
client = anthropic.Anthropic(
    api_key="your-anthropic-key",
    base_url="https://anthropic.helicone.ai",
    default_headers={
        "Helicone-Auth": "Bearer your-helicone-key",
        "Helicone-Property-App": "lead-agent",  # custom metadata
    }
)
# All your existing code works unchanged — Helicone captures everything

Helicone Pricing and Data Retention

The free tier gives you 10,000 requests/month. After that, you’re paying $0.0002 per request on the pay-as-you-go plan — so 1M requests/month costs $200. Enterprise plans start around $200/month flat for higher volumes. Data retention on the free tier is 30 days; paid plans extend to 365 days.

Where Helicone Wins and Where It Falls Short

The dashboard is genuinely excellent for cost visibility. You can slice spend by custom properties — by user, by agent type, by workflow — within minutes of setup. The latency percentile charts are production-quality. Alerting exists but is basic: threshold alerts on cost and error rate, delivered via email or webhook. No anomaly detection, no ML-based drift alerts.

The proxy approach is also its main weakness: it adds ~20-50ms of latency per call (they publish 99th percentile numbers on their status page), and if Helicone goes down, your LLM calls fail unless you’ve built fallback logic. For high-stakes synchronous agents, that’s a real operational dependency. The tracing depth is also shallower — Helicone captures request/response pairs well but doesn’t natively support multi-step span hierarchies the way LangSmith or Langfuse do. It’s better suited for single-turn monitoring than deep agent tracing.

Ideal for: teams already running well-defined pipelines who need immediate cost visibility with minimal code changes.

LangSmith: Best for LangChain-Native Teams

LangSmith is LangChain’s official observability product, and the integration if you’re using LangChain or LangGraph is literally two environment variables:

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-key"
os.environ["LANGCHAIN_PROJECT"] = "production-agent-v2"

# That's it — all LangChain/LangGraph calls are now traced automatically
# For non-LangChain code, use the decorator approach:
from langsmith import traceable

@traceable(name="custom_tool_call", tags=["retrieval"])
def retrieve_docs(query: str) -> list:
    # your retrieval logic
    return docs

LangSmith Pricing and Data Retention

Developer tier is free: 5,000 traces/month, 14-day retention. Plus tier is $39/month per seat: 50,000 traces included, then $0.005 per 1,000 additional traces. Data retention extends to 400 days on paid plans. The trace volume pricing sounds cheap until you’re running a high-frequency agent — 10 LLM calls per trace means 50K traces covers only 500K LLM calls before you start paying more.

LangSmith’s Debugging UI is Its Real Advantage

This is where LangSmith genuinely pulls ahead. The trace waterfall view is detailed: you can see every chain step, every tool call, every retrieval result, the exact prompt that was sent (after template rendering), and the model’s full response. For debugging a multi-hop agent where step 4 is hallucinating because step 2 retrieved the wrong context, this is invaluable. You can replay individual traces with modified inputs — critical for regression testing.

The evaluation framework is also baked in. You can create datasets from production traces and run automated evaluations against them. If you’re building anything like an LLM output quality evaluation pipeline, LangSmith’s human annotation and LLM-as-judge tooling integrates natively with your trace data.

The alerting story is weak. LangSmith has no native alert system as of early 2025 — you have to poll their API or export to an external system. For production monitoring where you need to be paged at 2am when your agent’s error rate spikes, this is a meaningful gap. Also, if you’re not on LangChain, the manual instrumentation is more verbose than Langfuse’s SDK.

Ideal for: LangChain/LangGraph users who need deep trace inspection and evaluation tooling and are okay building their own alerting layer.

Langfuse: Best Open-Source Option for Self-Hosting

Langfuse is the only one of the three with a fully open-source, self-hostable option. The cloud version is comparable to the others, but the self-hosted path (Docker Compose or Kubernetes) is genuinely production-ready and actively maintained.

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"  # or your self-hosted URL
)

@observe()  # automatically creates a trace
def run_agent(user_input: str) -> str:
    # nested @observe() decorators create child spans automatically
    context = retrieve_context(user_input)

    langfuse_context.update_current_observation(
        input=user_input,
        metadata={"retrieved_chunks": len(context)}
    )

    response = call_llm(context, user_input)
    return response

@observe(name="llm_call", capture_input=True, capture_output=True)
def call_llm(context: str, query: str) -> str:
    # Langfuse captures cost automatically if you pass model + usage
    return client.messages.create(model="claude-3-5-sonnet-20241022", ...)

Langfuse Pricing and Data Retention

Cloud free tier: 50,000 observations/month (an “observation” is any span, so a 10-step trace uses 10 observations). Pro is $59/month for 1M observations, then $10 per additional 100K. Self-hosted is fully free — your costs are just infrastructure (a $10/month Fly.io or Railway instance handles modest loads). Data retention on cloud: 30 days free, 90 days Pro, unlimited self-hosted.

Langfuse’s Alerting and Dashboard

Langfuse’s alerting is more mature than LangSmith’s but less polished than what you’d get from a dedicated monitoring stack. You can configure score-based alerts (trigger when LLM-as-judge evaluation drops below threshold) and webhook notifications. The dashboard has configurable metric charts — latency, cost, error rate, custom scores — and supports SQL-level data exports.

The UI is slightly less polished than LangSmith’s trace waterfall, but it handles high observation volumes better. At 100K+ traces, LangSmith’s UI can feel sluggish; Langfuse stays snappy. The prompt management feature — version control for your prompts, fetched at runtime — is a genuinely useful production feature the others lack. You can roll back a bad prompt without a code deployment.

One rough edge: the Python SDK’s decorator-based approach can conflict with async frameworks in unexpected ways. I’ve hit issues with FastAPI + async LangChain where traces were dropped silently. Always test your instrumentation under load before treating it as reliable. If you’re pairing this with safety monitoring and drift detection for your agents, Langfuse’s custom scoring system can feed directly into those pipelines.

Ideal for: teams with data residency requirements, cost-sensitive builders who want to self-host, and anyone who needs prompt versioning as part of their production workflow.

Head-to-Head Comparison

Feature Helicone LangSmith Langfuse
Integration method Proxy (base URL swap) SDK / env vars SDK decorators
Setup time ~5 minutes ~10 minutes (LangChain), ~30 min (custom) ~15–20 minutes
Free tier traces/month 10,000 requests 5,000 traces 50,000 observations
Free data retention 30 days 14 days 30 days
Multi-step agent tracing Shallow Excellent Excellent
Cost visibility Excellent Good Good
Native alerting Basic (threshold) None (API only) Webhook + score-based
Evaluation / LLM-as-judge None Excellent Good
Prompt versioning No Yes (LangChain Hub) Yes (built-in)
Self-hosting No Limited (enterprise) Yes (fully open-source)
Paid entry price $0.0002/req $39/seat/month $59/month
Proxy latency overhead 20–50ms None None (async flush)
LangChain native No Yes Partial

Production Cost Reality Check

At 500K LLM calls/month across a production agent fleet — which is modest if you’re running anything like an automated email lead generation agent at scale — here’s what you’d actually pay:

  • Helicone: ~$100/month (500K × $0.0002). Plus the latency tax on every synchronous call.
  • LangSmith: Depends heavily on trace depth. If your average trace has 8 spans, you have ~62K traces. Plus tier ($39/seat) covers 50K, so $39 + ~$0.60 in overage. Very cheap at this volume.
  • Langfuse Cloud: 500K calls at ~4 observations each = 2M observations. Pro tier ($59) includes 1M, so roughly $69/month total. Or self-host for ~$15/month on a small VPS.

For teams who are already tracking costs carefully — and if you’re not, the LLM cost management guide is worth reading first — Langfuse self-hosted wins on total cost of ownership at any meaningful volume.

Verdict: Which Platform to Choose

Choose Helicone if you have simple, mostly single-turn pipelines, you want zero code changes to your existing setup, and your primary concern is cost tracking rather than deep debugging. Also valid if you’re on a small team and time-to-value matters more than tracing depth.

Choose LangSmith if you’re deep in the LangChain/LangGraph ecosystem, your team is running regular evaluations and red-teaming sessions, and the per-seat pricing fits your team size. The debugging UI at trace level is the best of the three — if you spend time every week investigating “why did this agent say that,” LangSmith pays for itself.

Choose Langfuse if you have data residency requirements (GDPR, HIPAA), you want to self-host to keep costs low at scale, or you need prompt versioning as a first-class production feature. The open-source trajectory also means you’re not locked into a single vendor’s pricing decisions.

My default recommendation for most production teams: start with Langfuse Cloud on the free tier. It has the most generous free tier by volume, the SDK is framework-agnostic, self-hosting is a real option if your data needs it, and the feature set covers 90% of what teams actually need in production. If you hit the evaluation and LLM-as-judge workflow hard, layer in LangSmith later — it’s not an either/or decision, many teams run both.

Frequently Asked Questions

Can I use Helicone, LangSmith, and Langfuse at the same time?

Yes, and many teams do. Helicone is proxy-based so it captures everything at the HTTP layer regardless of what SDK you’re using. You could run Helicone for cost visibility while using Langfuse or LangSmith for deeper trace instrumentation. The main downside is complexity — managing two observability configurations adds overhead. Start with one and add a second only if you have a clear gap it fills.

How do I self-host Langfuse in production?

Langfuse publishes an official Docker Compose file that sets up the app server and a Postgres database. For production use, run Postgres on a managed service (Supabase, Railway, Neon) rather than in a container, and deploy the app server behind a reverse proxy like Caddy or nginx. Their Kubernetes Helm chart is also maintained for higher-scale deployments. Budget ~$15–30/month on Railway or Fly.io for a setup that handles hundreds of thousands of observations.

Does LangSmith work with non-LangChain code like direct Anthropic or OpenAI API calls?

Yes, via the @traceable decorator or the RunTree API. The auto-instrumentation only fires for LangChain components, but you can manually wrap any function. It’s more verbose than Langfuse’s decorator pattern for non-LangChain code, but it works. Expect to spend 30–60 minutes instrumenting a medium-complexity custom agent.

What’s the latency impact of using Helicone’s proxy in production?

Helicone publishes a median overhead of around 10–20ms and P99 of 40–50ms. For streaming responses this is less noticeable, but for synchronous, latency-sensitive applications it’s a real consideration. They have edge nodes in multiple regions, so routing to the nearest one matters. If your agent has strict SLA requirements under 100ms end-to-end, test the proxy overhead specifically in your deployment region before committing.

Which platform has the best alerting for production agents?

None of them are great out of the box. Helicone has threshold alerts via email and webhook. Langfuse has the most flexible system — webhook triggers on score thresholds and error rates — which lets you pipe alerts into PagerDuty or Slack. LangSmith has no native alerting as of early 2025; you’d need to poll their API and build your own. For serious production monitoring, most teams export metrics to Grafana or Datadog alongside whichever LLM-specific tool they choose.

How does LLM observability differ from regular application monitoring?

Standard APM tools (Datadog, New Relic) capture latency, error rate, and request volume but know nothing about what’s inside an LLM call. LLM observability adds prompt-level capture, token usage tracking, model output quality scoring, and trace hierarchies that follow reasoning chains across multiple model calls. The overlap is infrastructure-level metrics — you typically want both, not one or the other.

Put this into practice

Try the Ai Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply