We analyzed 100K+ Langfuse traces. Here’s what’s hiding in production.

Model version drift: the quiet tax doubling your OpenAI bill

Feb 11, 2026

We’ve been helping folks with a variety of production Langfuse projects—including voice AI, agentic tooling, analytics platforms, healthcare pipelines, and data extraction services. Running over 100,000 traces, representing thousands of dollars in monthly LLM spend, through our automated analysis pipeline revealed findings that were far worse than we expected.

The problem wasn’t that any single project was performing an unusual task, but that the same patterns of inefficiency were consistently showing up everywhere. These issues were visible in the traces the entire time, yet they were sitting on dashboards that nobody was looking at closely enough.Most projects have a caching problem.

Connect your Langfuse and get a PR that saves you money with Jetty.

We discovered one extraction service that sent the exact same input to an LLM over 70% of the time. The same file, the same content, and the same result were repeatedly processed with no caching or deduplication in place. Each redundant call cost a few cents and took about a minute, when a simple cache lookup could have returned the result instantly.

Worse still, these redundant calls led to cascading failures. Burst traffic would overwhelm the upstream API, triggering timeout errors, with every single failure occurring on an input that had already been successfully processed dozens of times. The solution is a straightforward, hash-based response cache, often requiring maybe 30 lines of code.

What to check: Are your LLMs repeatedly processing the same inputs? Implementing even simple hash-based deduplication can dramatically cut both spend and failure rates.System prompts are a hidden cost multiplier

For those using conversational AI frameworks like VAPI, it is easy to overspend without even realizing it. In multiple projects, we observed that every turn included the full system prompt—thousands of characters of business context and behavioral instructions—repeated identically each time. In a 50-turn phone conversation, that prompt is sent 50 times. Consequently, over 90% of the token spend in these projects was dedicated to resending the system prompt. The most expensive single conversation we found, despite not being particularly long, cost $0.36. Prompt caching, which has been available from OpenAI since late 2024 and is supported by most major providers, can cut input token costs by 60–70%. This represents a meaningful monthly saving achievable through a single configuration change.

What to check: Examine your input-to-output token ratio. If input tokens dominate by 100:1 or more, it means you are likely paying to resend the same context with every turn.You’re probably using the wrong model

This issue appeared in two distinct forms across nearly every project we analyzed.

Model version drift: Teams often pin model versions for stability but then forget about them. We found projects simultaneously running half a dozen versions of GPT-4o, with the older versions costing double the newer ones, despite delivering the same prompts and the same quality.

Overpowered models for simple tasks: One project was using a frontier model for intent classification, a task that averages only 70 output tokens. A mini model could handle this at a mere fraction of the cost. Another project spent over 80% of its budget on the most expensive available model for operations that had no need for its advanced capabilities.

What to check: Group your traces by model version and compare the associated costs. Then, look at your simplest operations—such as classification, routing, and formatting—and question whether they genuinely require your most capable model.Agentic workflows compound costs fast

Agentic workflows that accumulate context over multiple turns generate compounding costs that are difficult to isolate at the dashboard level. As each step adds to the context window, the 20th call in a chain processes far more tokens than the first. We saw traces where individual generations were processing over 100K input tokens by the end of a session.

The single most expensive trace across everything we analyzed—a single workflow execution—cost over $50.

This behavior is inherent to how agentic architectures function. However, without per-trace cost visibility broken down by step, you have no clear way to determine which specific operations are driving the bill.

What to check: Focus on your most expensive traces, not just your averages. If your p99 cost is ten times your median, context accumulation is the probable cause. Solutions include summarizing intermediate context or routing later steps to cheaper models.Errors are hiding in your traces

One project logged a startling 134% error rate, meaning they had more errors than traces in a given month. These were not intermittent blips but systemic failures that had been running for weeks. Another pipeline was hitting 27-minute latencies on individual operations. These numbers—the error counts—were present in Langfuse the entire time, but no one had aggregated them by step in a way that made the severity obvious enough to prompt action.

What to check: Aggregate your error rates by pipeline step, rather than just looking at the overall rate. An aggregate rate of 5% might be concealing a single step that fails 40% of the time.Some of your spend is probably invisible

Across the projects we analyzed, a significant portion of LLM calls showed a cost of $0 in Langfuse. This typically happens when calls are routed through Azure or other deployments that do not report usage data back. We even found one project with zero cost tracking altogether. You cannot optimize what you cannot see. If your Langfuse dashboard costs appear lower than your actual cloud bill, that discrepancy is not a saving—it is a blind spot.

What to check: Compare your Langfuse-reported costs directly against your actual invoices. If there is a meaningful gap, determine which providers are failing to report usage data.The gap between observability and optimization

Every project we examined had observability in place; they could see their traces, latencies, and model usage. Yet, observability alone was insufficient to surface these problematic patterns. The redundant computation was visible in individual traces, but you had to look at all of them to spot the duplication. The model version tax was hidden in a column nobody thought to group by. The error rate was available, but not aggregated in a manner that compelled anyone to act.

The gap, therefore, is not in data collection—which Langfuse handles well—but in systematically analyzing that data. The difference is between merely seeing traces individually and understanding their collective meaning in aggregate. Through this analysis, we identified savings opportunities ranging from 30% to over 90% of current spend. Crucially, the fixes were not exotic: a response cache, a simple config change for prompt caching, pinning a model version, or routing simple tasks to smaller models. Most of these solutions required less than an hour of work.

If you are currently running LLM workloads through Langfuse and have not performed this type of recent analysis, you may be surprised by what is hidden in your traces.

Ground Truth

Discussion about this post

Ready for more?