Prompt token optimization: reducing LLM API costs without sacrificing quality

Q: How do I avoid breaking agents that rely on long conversation history?

Use the keep_last_n parameter to guarantee recent context is always preserved, and set preserve_tool_results=True to protect tool call/result pairs. For agents where memory is critical, consider summarizing older turns rather than dropping them — pass the old messages through a cheap Haiku call to generate a 100-token summary, then replace 20 messages with that summary.

By the end of this tutorial, you’ll have a working Python toolkit that audits your prompts for token waste, applies compression techniques automatically, and measures the quality delta so you know exactly what you’re trading away. Developers running high-volume Claude workflows have cut prompt token optimization costs by 40–60% using these techniques without touching output quality.

A quick reality check first: token optimization isn’t magic. It’s engineering. You’re making deliberate tradeoffs between verbosity and precision, and some of those tradeoffs will hurt performance if you’re not measuring. This tutorial gives you the measurement framework alongside the compression techniques.

What you’ll build — step overview

Install dependencies and set up token counting — baseline measurement before you optimize anything
Audit your existing prompts — find the actual waste (it’s usually not where you think)
Apply structural compression — rewrite instructions to use fewer tokens without losing meaning
Strip redundant context dynamically — remove conversation history that no longer contributes signal
Validate quality with automated scoring — confirm you haven’t quietly broken your agent
Wire it into a production wrapper — a drop-in optimizer class you can use immediately

Step 1: Install dependencies and set up token counting

You need anthropic, tiktoken (faster than calling the API just to count tokens), and jinja2 for prompt templating. Pin versions — this stuff breaks on minor updates.

pip install anthropic==0.34.2 tiktoken==0.7.0 jinja2==3.1.4 python-dotenv==1.0.1

Now build a token counter that works offline. Claude uses a tokenizer close to cl100k_base, which tiktoken ships. It’s not exact — expect ±2% variance — but it’s fast enough to run on every prompt before sending.

import tiktoken
import anthropic
import os

# cl100k_base is the closest public tokenizer to Claude's internal one
encoder = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    """Fast offline token count. ~2% variance vs Anthropic's actual count."""
    return len(encoder.encode(text))

def count_message_tokens(messages: list[dict]) -> int:
    """Count tokens across a full messages array."""
    total = 0
    for msg in messages:
        content = msg.get("content", "")
        if isinstance(content, str):
            total += count_tokens(content)
        elif isinstance(content, list):
            # Handle content blocks (tool results, images, etc.)
            for block in content:
                if block.get("type") == "text":
                    total += count_tokens(block.get("text", ""))
    return total

# Establish baseline before touching anything
def audit_prompt(system: str, messages: list[dict]) -> dict:
    system_tokens = count_tokens(system)
    message_tokens = count_message_tokens(messages)
    total = system_tokens + message_tokens
    
    # At Claude Sonnet 3.5 pricing: $3 per 1M input tokens
    estimated_cost = (total / 1_000_000) * 3.0
    
    return {
        "system_tokens": system_tokens,
        "message_tokens": message_tokens,
        "total_tokens": total,
        "estimated_cost_per_call": round(estimated_cost, 6),
        "estimated_cost_per_1k_calls": round(estimated_cost * 1000, 4)
    }

Run this on your actual prompts before doing anything else. Most engineers are surprised: system prompts are usually 40–70% of input tokens, and half of that is typically boilerplate that doesn’t need to be there on every call.

Step 2: Audit your existing prompts for token waste

The audit function finds categories of waste. There are four main offenders I see repeatedly in production systems.

import re
from dataclasses import dataclass

@dataclass
class WasteReport:
    filler_phrases: int       # "Please", "I would like you to", "As an AI"
    redundant_examples: int   # Examples that repeat the same pattern
    verbose_formatting: int   # Markdown in prompts that adds tokens, not meaning
    stale_context: int        # Old conversation turns with zero relevance

def audit_for_waste(system_prompt: str, messages: list[dict]) -> WasteReport:
    # Filler phrases that add tokens without adding instruction clarity
    filler_patterns = [
        r"\bplease\b", r"\bkindly\b", r"\bi would like you to\b",
        r"\bas an ai\b", r"\bcertainly\b", r"\bof course\b",
        r"\bsure, i(?:'ll| will)\b", r"\bfeel free to\b",
        r"\bit(?:'s| is) important to note that\b",
        r"\byour role is to\b",  # When preceded by extensive role setup
    ]
    
    full_text = system_prompt + " ".join(
        m.get("content", "") for m in messages if isinstance(m.get("content"), str)
    )
    
    filler_count = sum(
        len(re.findall(p, full_text, re.IGNORECASE)) 
        for p in filler_patterns
    )
    
    # Detect repeated example patterns (same structure, different values)
    example_blocks = re.findall(r'(?:example|e\.g\.|for instance)[:\s]+.{20,200}', 
                                 full_text, re.IGNORECASE)
    redundant_examples = max(0, len(example_blocks) - 2)  # Keep max 2 examples
    
    # Markdown in system prompts (### headers, **bold** etc.) adds tokens
    markdown_tokens = len(re.findall(r'#{1,6}\s|[*_]{1,2}|\[.+?\]\(.+?\)', system_prompt))
    
    # Stale context: messages older than 6 turns that are purely informational
    stale = max(0, len(messages) - 12)  # Rough heuristic
    
    return WasteReport(filler_count, redundant_examples, markdown_tokens, stale)

# Example output on a real prompt I've seen in production:
# WasteReport(filler_phrases=14, redundant_examples=3, verbose_formatting=22, stale_context=8)
# That's ~380 tokens of pure waste on a 1,200-token prompt

Step 3: Apply structural compression to instructions

This is where the real gains are. The goal isn’t truncation — it’s rewriting instructions to be denser. English is a verbose language; imperative bullet points are not.

def compress_system_prompt(prompt: str) -> str:
    """
    Apply rule-based compression. Order matters — run substitutions first,
    then structure rewrites, then whitespace cleanup.
    """
    # 1. Strip filler phrases
    filler_substitutions = {
        r"please\s+": "",
        r"kindly\s+": "",
        r"i would like you to\s+": "",
        r"your task is to\s+": "",
        r"you are required to\s+": "",
        r"it is important that you\s+": "",
        r"make sure to\s+": "always ",
        r"always remember to\s+": "always ",
        r"feel free to\s+": "",
        r"note that\s+": "",
        r"please note:\s+": "",
    }
    
    result = prompt
    for pattern, replacement in filler_substitutions.items():
        result = re.sub(pattern, replacement, result, flags=re.IGNORECASE)
    
    # 2. Convert verbose instructions to imperative bullet style
    # "You should respond with JSON that includes..." → "Respond with JSON including..."
    result = re.sub(r"you should\s+", "", result, flags=re.IGNORECASE)
    result = re.sub(r"you must\s+", "", result, flags=re.IGNORECASE)
    result = re.sub(r"you will\s+", "", result, flags=re.IGNORECASE)
    
    # 3. Collapse redundant whitespace and empty lines
    result = re.sub(r'\n{3,}', '\n\n', result)
    result = re.sub(r'[ \t]+', ' ', result)
    result = result.strip()
    
    return result

# Before/after example:
before = """
Please make sure to respond in JSON format. You are required to include 
the following fields: name, email, and status. It is important that you 
validate the email format before including it. Please note that if the 
email is invalid, you should set status to 'invalid'. Feel free to add 
any additional context you think is helpful.
"""

after = compress_system_prompt(before)
print(f"Before: {count_tokens(before)} tokens")
print(f"After:  {count_tokens(after)} tokens")
# Before: 74 tokens → After: 38 tokens (49% reduction)

If you’re also trying to get reliable structured output, compression and schema constraints work together — see our guide on getting consistent JSON from any LLM for the validation layer that catches regressions when your prompt changes.

Step 4: Strip redundant context dynamically

Conversation history is the sneaky token sink. Most agents pass the full message history on every turn. By turn 20, you’re paying for context that’s irrelevant to the current task.

def smart_truncate_messages(
    messages: list[dict],
    max_tokens: int = 4000,
    keep_last_n: int = 6,
    preserve_tool_results: bool = True
) -> list[dict]:
    """
    Truncate message history intelligently:
    - Always keep the last N turns (recent context is highest signal)
    - Keep any tool call/result pairs (truncating these breaks agents)
    - Drop middle messages until under token budget
    """
    if count_message_tokens(messages) <= max_tokens:
        return messages  # Already fine, no truncation needed
    
    # Always preserve the tail (recent messages)
    tail = messages[-keep_last_n:] if len(messages) > keep_last_n else messages
    head_messages = messages[:-keep_last_n] if len(messages) > keep_last_n else []
    
    # From the head, keep tool results if preserve_tool_results=True
    # Tool results are identified by role='tool' in most implementations
    if preserve_tool_results:
        important_head = [
            m for m in head_messages 
            if m.get("role") == "tool" or 
               (isinstance(m.get("content"), list) and 
                any(b.get("type") == "tool_result" for b in m.get("content", [])))
        ]
    else:
        important_head = []
    
    # Build final message list within budget
    candidate = important_head + tail
    
    # If still over budget, trim from the front of important_head
    while count_message_tokens(candidate) > max_tokens and len(important_head) > 0:
        important_head.pop(0)
        candidate = important_head + tail
    
    return candidate

# For high-volume agents: this alone can cut per-call costs by 30%
# on conversations that run longer than 10 turns

This pairs well with LLM caching strategies — if your truncated context is stable across calls, you can cache the prefix and pay nothing for repeated tokens.

Step 5: Validate quality with automated scoring

Compression without measurement is just guessing. You need to know when your optimizations start degrading outputs. This is a lightweight eval loop — not a full benchmark suite, but enough to catch regressions.

import json

def run_quality_check(
    client: anthropic.Anthropic,
    original_system: str,
    compressed_system: str,
    test_cases: list[dict],  # [{"input": "...", "expected_fields": [...]}]
    model: str = "claude-haiku-4-5"  # Use Haiku for cheap eval runs
) -> dict:
    """
    Run the same test cases through original and compressed prompts.
    Compare: response length, field completeness, latency.
    Cost at Haiku pricing ($0.80/1M input): ~$0.002 per 10 test cases.
    """
    results = {"original": [], "compressed": [], "token_savings": 0}
    
    for case in test_cases:
        # Original prompt
        orig_response = client.messages.create(
            model=model,
            max_tokens=512,
            system=original_system,
            messages=[{"role": "user", "content": case["input"]}]
        )
        
        # Compressed prompt
        comp_response = client.messages.create(
            model=model,
            max_tokens=512,
            system=compressed_system,
            messages=[{"role": "user", "content": case["input"]}]
        )
        
        orig_text = orig_response.content[0].text
        comp_text = comp_response.content[0].text
        
        # Check expected fields are present in output
        orig_score = sum(
            1 for field in case.get("expected_fields", []) 
            if field.lower() in orig_text.lower()
        ) / max(len(case.get("expected_fields", [1])), 1)
        
        comp_score = sum(
            1 for field in case.get("expected_fields", []) 
            if field.lower() in comp_text.lower()
        ) / max(len(case.get("expected_fields", [1])), 1)
        
        results["original"].append({"score": orig_score, "tokens": orig_response.usage.input_tokens})
        results["compressed"].append({"score": comp_score, "tokens": comp_response.usage.input_tokens})
    
    avg_orig_score = sum(r["score"] for r in results["original"]) / len(results["original"])
    avg_comp_score = sum(r["score"] for r in results["compressed"]) / len(results["compressed"])
    avg_token_saving = (
        sum(r["tokens"] for r in results["original"]) - 
        sum(r["tokens"] for r in results["compressed"])
    ) / len(test_cases)
    
    results["summary"] = {
        "original_avg_score": round(avg_orig_score, 3),
        "compressed_avg_score": round(avg_comp_score, 3),
        "quality_delta": round(avg_comp_score - avg_orig_score, 3),
        "avg_tokens_saved": round(avg_token_saving),
        "acceptable": avg_comp_score >= avg_orig_score * 0.95  # 5% tolerance
    }
    
    return results

Set your quality threshold before you start compressing, not after. I use 95% of original score as the floor. Below that, the compression is hurting you more than helping. For agents doing complex reasoning tasks, I’d tighten that to 98% — automated output quality evaluation goes deeper on setting appropriate metrics per task type.

Step 6: Wire everything into a production wrapper

class TokenOptimizedClient:
    """
    Drop-in wrapper around the Anthropic client.
    Compresses prompts, truncates history, and tracks savings.
    """
    
    def __init__(self, api_key: str, max_context_tokens: int = 6000):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.max_context_tokens = max_context_tokens
        self.total_tokens_saved = 0
    
    def create(
        self, 
        model: str,
        system: str,
        messages: list[dict],
        max_tokens: int = 1024,
        compress: bool = True,
        **kwargs
    ) -> anthropic.types.Message:
        original_token_count = count_tokens(system) + count_message_tokens(messages)
        
        if compress:
            system = compress_system_prompt(system)
            messages = smart_truncate_messages(
                messages, 
                max_tokens=self.max_context_tokens
            )
        
        optimized_token_count = count_tokens(system) + count_message_tokens(messages)
        self.total_tokens_saved += (original_token_count - optimized_token_count)
        
        return self.client.messages.create(
            model=model,
            system=system,
            messages=messages,
            max_tokens=max_tokens,
            **kwargs
        )
    
    def savings_report(self) -> dict:
        # Sonnet 3.5 pricing for reference
        cost_saved = (self.total_tokens_saved / 1_000_000) * 3.0
        return {
            "tokens_saved": self.total_tokens_saved,
            "estimated_cost_saved_usd": round(cost_saved, 4)
        }

# Usage
client = TokenOptimizedClient(api_key=os.environ["ANTHROPIC_API_KEY"])
response = client.create(
    model="claude-sonnet-4-5",
    system="Your system prompt here...",
    messages=[{"role": "user", "content": "Hello"}],
    max_tokens=512
)
print(client.savings_report())
# {"tokens_saved": 14823, "estimated_cost_saved_usd": 0.0445} after 100 calls

Common errors

Over-compression breaking structured output

If your prompt instructs the model to return JSON with specific fields, stripping “it is important that” from “it is important that the JSON includes the ‘status’ field” can cause the model to occasionally drop that field. Fix: Keep field-level constraints verbatim. Only compress framing language around them. Run your quality check after each compression pass, not just at the end.

Truncating tool results mid-sequence

The smart_truncate_messages function preserves tool results by default, but if you set preserve_tool_results=False and your agent has pending tool calls, you’ll get API errors or silently wrong behavior. Fix: Never truncate a tool_use message without also removing its corresponding tool_result. Always remove pairs, not individual messages.

tiktoken count diverging from actual billing

tiktoken’s cl100k_base underestimates Claude’s actual token count by 1–5% on prompts with lots of special characters, code blocks, or non-English text. Fix: For cost forecasting, multiply your tiktoken count by 1.05 to add a safety buffer. For production billing tracking, pull actual counts from response.usage.input_tokens — that’s the number you’re actually paying for.

What to build next

Add semantic deduplication to your context window. Instead of just keeping the last N messages, embed each message and use cosine similarity to drop messages that are semantically redundant to more recent ones. A user saying “make it shorter” on turn 3 is irrelevant if they said “actually make it longer” on turn 15. This is a meaningful extension of the truncation logic above and pairs naturally with hybrid search approaches if you’re already embedding content for retrieval.

If you’re running this inside a larger agent system and need to handle the infrastructure side — rate limits, retries, cost budgets — the patterns here slot directly into a serverless deployment. The TokenOptimizedClient class is stateless by design, which makes it trivial to deploy as a Lambda or Cloud Run function.

Bottom line by reader type: If you’re a solo founder running a few hundred calls per day, step 3 alone (structural compression) will recover your costs. If you’re running thousands of calls daily, implement all six steps and add the semantic deduplication extension — at that volume, prompt token optimization costs compound fast enough to meaningfully affect your margin. Teams running multi-agent systems should also look at whether the caching and batching strategies can be layered on top of what’s built here for an additional 20–30% reduction.

Frequently Asked Questions

How much can I realistically reduce my Claude API token costs without hurting quality?

40–60% is achievable for most production prompts with the techniques in this tutorial. The biggest gains usually come from system prompt compression (10–30%) and conversation history truncation (15–35%). Quality degradation is rare if you keep your compression to framing language and avoid touching semantic instructions — always validate with the quality check before deploying.

Does prompt compression work the same way for Claude Haiku vs Sonnet vs Opus?

The token counting and compression techniques are model-agnostic — the same bytes cost the same tokens regardless of which Claude model you use. The tradeoff is different per model: Haiku is cheap enough that a 40% token reduction saves less in absolute dollars, but it matters more because Haiku is often used at 10–100x the volume. For Opus, even a 20% reduction is significant given its $15/1M input price.

Will removing filler phrases from prompts change how Claude behaves?

Polite framing words (“please”, “kindly”, “I would like you to”) have negligible effect on Claude’s actual outputs — they’re processed as tokens but don’t carry meaningful instructional weight. What you must not remove are specific behavioral constraints, output format specifications, and domain-specific terminology. The audit function in Step 2 targets only the former.

Is tiktoken accurate enough for cost estimation with Claude?

Close enough for planning — expect 1–5% variance from actual billing, higher on non-English text and code. For accurate per-call tracking in production, always read response.usage.input_tokens from the API response and log it. Use tiktoken only for pre-flight checks before sending the request, where you want to avoid the latency of an API round-trip just to count tokens.

How do I avoid breaking agents that rely on long conversation history?

Use the keep_last_n parameter to guarantee recent context is always preserved, and set preserve_tool_results=True to protect tool call/result pairs. For agents where memory is critical, consider summarizing older turns rather than dropping them — pass the old messages through a cheap Haiku call to generate a 100-token summary, then replace 20 messages with that summary.

Can I apply these techniques to GPT-4o or other OpenAI models?

Yes — the structural compression and context truncation logic is model-agnostic. GPT-4o uses the same cl100k_base tokenizer as tiktoken natively, so your counts will actually be more accurate there than with Claude. Swap the anthropic client for the openai client and the wrapper class works unchanged. Pricing is different ($2.50/1M input for GPT-4o mini vs $0.80/1M for Claude Haiku) but the optimization logic is identical.

Put this into practice

Try the Prompt Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Prompt token optimization: reducing LLM API costs without sacrificing quality

Claude MCP servers: complete setup guide for production tool integrations

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Profiling users from behavior: privacy implications and safety considerations for Claude agents

Prompt token optimization: reducing LLM API costs without sacrificing quality

What you’ll build — step overview

Step 1: Install dependencies and set up token counting

Step 2: Audit your existing prompts for token waste

Step 3: Apply structural compression to instructions

Step 4: Strip redundant context dynamically

Step 5: Validate quality with automated scoring

Step 6: Wire everything into a production wrapper

Common errors

Over-compression breaking structured output

Truncating tool results mid-sequence

tiktoken count diverging from actual billing

What to build next

Frequently Asked Questions

How much can I realistically reduce my Claude API token costs without hurting quality?

Does prompt compression work the same way for Claude Haiku vs Sonnet vs Opus?

Will removing filler phrases from prompts change how Claude behaves?

Is tiktoken accurate enough for cost estimation with Claude?

How do I avoid breaking agents that rely on long conversation history?

Can I apply these techniques to GPT-4o or other OpenAI models?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Profiling users from behavior: privacy implications and safety considerations for Claude agents