Building an AI-Powered Contract Review Agent: Document Upload, Analysis, and Reporting

Most contract review tooling falls into two camps: expensive legal SaaS that wraps a model you can’t control, or toy demos that extract a few clauses and call it done. If you’re building a contract review agent for real workflows — law firms, ops teams, or your own product — you need something in between: a system that handles messy PDFs, understands context across long documents, flags actual risks, and produces reports a non-technical stakeholder can act on. That’s what this walkthrough builds.

We’ll use Claude’s API directly (Anthropic’s claude-3-5-sonnet-20241022 model is the sweet spot here), Python for orchestration, and PyMuPDF for PDF extraction. By the end you’ll have a working agent pipeline: upload a contract, extract structured terms, run risk analysis, and generate a markdown/HTML report. Estimated API cost is around $0.01–$0.04 per contract at current Sonnet pricing, depending on document length.

Why Claude for Contract Analysis (and Where It Struggles)

Claude handles long-form legal text better than most models out of the box. Its 200K context window means you can pass an entire 80-page MSA without chunking. GPT-4o is comparable but Claude tends to be more conservative in its risk flagging — which is what you want in legal contexts. It’s less likely to hallucinate a clause that doesn’t exist.

That said, be honest about the failure modes before you ship anything:

Scanned PDFs are a hard dependency — if the document is an image-based scan without OCR, your extraction will return garbage. You need a pre-processing step (Tesseract, AWS Textract, or Adobe PDF Services).
Very long contracts with dense cross-references can confuse clause attribution. “As defined in Section 14.2(b)” five levels deep is hard to resolve correctly.
Jurisdiction-specific nuance — the model knows a lot about US/UK contract law but will make mistakes on edge cases. Always position output as a first-pass review, not legal advice.
Tables and schedules often extract poorly from PDFs. Payment schedules, SLA matrices, and pricing annexes need special handling.

Project Architecture: What We’re Actually Building

The pipeline has four stages that run sequentially:

Ingestion — Extract text from PDF, clean whitespace, detect page structure
Extraction — Pull structured data (parties, dates, payment terms, IP clauses, termination rights)
Risk Analysis — Flag problematic clauses against a configurable ruleset
Report Generation — Produce a human-readable summary with risk ratings

Each stage calls Claude independently with a focused prompt. This is intentional — chaining everything into one mega-prompt sounds cleaner but performs worse on long documents. Smaller, targeted prompts with specific output formats are more reliable and easier to debug.

Setting Up Dependencies

pip install anthropic pymupdf python-dotenv jinja2

import anthropic
import fitz  # PyMuPDF
import json
import os
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
MODEL = "claude-3-5-sonnet-20241022"

Stage 1: PDF Extraction That Actually Works

PyMuPDF is faster and more reliable than pdfplumber for most contracts. The key is preserving some structure while stripping the noise that breaks LLM parsing — excessive whitespace, page headers/footers, and watermarks.

def extract_contract_text(pdf_path: str) -> dict:
    """
    Extract text from PDF with basic structure preservation.
    Returns dict with full_text and page_count.
    """
    doc = fitz.open(pdf_path)
    pages = []
    
    for page_num, page in enumerate(doc, 1):
        text = page.get_text("text")
        
        # Strip common header/footer patterns (customize per client)
        lines = text.split('\n')
        cleaned_lines = [
            line for line in lines
            if len(line.strip()) > 2  # remove single-char artifacts
            and not line.strip().isdigit()  # remove standalone page numbers
        ]
        pages.append({
            "page": page_num,
            "text": '\n'.join(cleaned_lines)
        })
    
    full_text = '\n\n'.join([f"[Page {p['page']}]\n{p['text']}" for p in pages])
    
    return {
        "full_text": full_text,
        "page_count": len(doc),
        "char_count": len(full_text)
    }

For a 30-page NDA, this typically produces around 15,000–25,000 tokens when sent to Claude. Well within Sonnet’s window. For 100+ page agreements, consider extracting by section header rather than passing the whole document — but that’s a chunking strategy for another article.

Stage 2: Structured Term Extraction

This is where most implementations go wrong. They ask Claude to “extract all important clauses” and get back free-form text that’s hard to process downstream. Force structured JSON output with a tight schema.

EXTRACTION_PROMPT = """You are a contract analyst. Extract the following information from this contract and return ONLY valid JSON — no prose, no markdown fences.

Schema:
{
  "parties": [{"name": string, "role": string}],
  "effective_date": string | null,
  "expiry_date": string | null,
  "auto_renewal": boolean,
  "renewal_notice_days": integer | null,
  "governing_law": string | null,
  "payment_terms": {
    "amount": string | null,
    "currency": string | null,
    "schedule": string | null,
    "late_payment_penalty": string | null
  },
  "termination": {
    "for_convenience": boolean,
    "notice_period_days": integer | null,
    "for_cause": boolean
  },
  "ip_ownership": string | null,
  "limitation_of_liability": string | null,
  "non_compete": boolean,
  "non_solicitation": boolean,
  "data_processing": boolean,
  "contract_type": string
}

Contract text:
{contract_text}"""

def extract_terms(contract_text: str) -> dict:
    response = client.messages.create(
        model=MODEL,
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": EXTRACTION_PROMPT.format(contract_text=contract_text)
        }]
    )
    
    raw = response.content[0].text.strip()
    
    # Strip markdown fences if the model includes them anyway (it sometimes does)
    if raw.startswith("```"):
        raw = raw.split('\n', 1)[1].rsplit('```', 1)[0]
    
    return json.loads(raw)

The JSON stripping fallback isn’t optional — even with explicit instructions, models occasionally wrap output in fences. Handle it gracefully rather than crashing. I’d recommend wrapping the json.loads call in a try/except that logs the raw output if parsing fails, so you can debug prompt drift over time.

Stage 3: Risk Analysis Against a Configurable Ruleset

This is the part that separates a useful contract review agent from a fancy clause extractor. You want to flag actual problems: uncapped liability, missing IP assignment, auto-renewal traps, one-sided termination rights.

Defining Your Risk Rules

RISK_RULES = [
    "Unlimited or uncapped liability exposure for one party",
    "Auto-renewal clauses with short notice windows (under 30 days)",
    "IP ownership assigned to the client/vendor without clear carve-outs for pre-existing IP",
    "Unilateral termination rights without mutual equivalence",
    "Non-compete clauses exceeding 12 months or covering unreasonably broad geography",
    "Missing data processing / GDPR provisions where personal data is involved",
    "Indemnification clauses that are heavily one-sided",
    "Payment terms with no late payment remedy for the payee",
    "Liquidated damages clauses that may be unenforceable penalties",
    "Evergreen agreements with no sunset clause"
]

RISK_PROMPT = """You are a senior contract lawyer conducting a risk review. 
Analyze the contract below against these specific risk categories:

{rules}

For each risk found, return a JSON array of objects:
[{{
  "risk_category": string,
  "severity": "high" | "medium" | "low",
  "clause_reference": string,  // quote the relevant text (max 100 chars)
  "explanation": string,        // 1-2 sentences explaining why this is a risk
  "recommendation": string      // specific action to address this
}}]

If no risks found for a category, omit it. Return ONLY the JSON array.

Contract:
{contract_text}"""

def analyze_risks(contract_text: str) -> list:
    rules_formatted = '\n'.join(f"- {r}" for r in RISK_RULES)
    
    response = client.messages.create(
        model=MODEL,
        max_tokens=3000,
        messages=[{
            "role": "user",
            "content": RISK_PROMPT.format(
                rules=rules_formatted,
                contract_text=contract_text
            )
        }]
    )
    
    raw = response.content[0].text.strip()
    if raw.startswith("```"):
        raw = raw.split('\n', 1)[1].rsplit('```', 1)[0]
    
    return json.loads(raw)

Customize RISK_RULES per client type. A SaaS vendor has different risk priorities than a freelancer reviewing client agreements. Making this configurable at runtime (load from a JSON config per client) is one of the highest-leverage things you can do to make this production-ready.

Stage 4: Generating the Report

REPORT_PROMPT = """Based on the extracted contract data and risk analysis below, 
write a professional contract review report in markdown format.

Structure:
## Contract Overview
## Key Terms Summary  
## Risk Assessment (use a table: Risk | Severity | Recommendation)
## Priority Actions
## Reviewer Notes

Keep the tone professional but accessible to non-lawyers. Be specific — quote clause text where relevant.

Extracted Terms:
{terms}

Identified Risks:
{risks}

Contract Type: {contract_type}"""

def generate_report(terms: dict, risks: list) -> str:
    response = client.messages.create(
        model=MODEL,
        max_tokens=2500,
        messages=[{
            "role": "user",
            "content": REPORT_PROMPT.format(
                terms=json.dumps(terms, indent=2),
                risks=json.dumps(risks, indent=2),
                contract_type=terms.get("contract_type", "Unknown")
            )
        }]
    )
    return response.content[0].text

Wiring It All Together

def run_contract_review(pdf_path: str, output_dir: str = "reports") -> dict:
    Path(output_dir).mkdir(exist_ok=True)
    contract_name = Path(pdf_path).stem
    
    print(f"Extracting text from {pdf_path}...")
    doc = extract_contract_text(pdf_path)
    
    print(f"Extracted {doc['char_count']:,} chars across {doc['page_count']} pages")
    
    print("Extracting structured terms...")
    terms = extract_terms(doc["full_text"])
    
    print("Running risk analysis...")
    risks = analyze_risks(doc["full_text"])
    
    high_risks = [r for r in risks if r.get("severity") == "high"]
    print(f"Found {len(risks)} risks ({len(high_risks)} high severity)")
    
    print("Generating report...")
    report_md = generate_report(terms, risks)
    
    # Save outputs
    report_path = f"{output_dir}/{contract_name}_review.md"
    with open(report_path, "w") as f:
        f.write(report_md)
    
    terms_path = f"{output_dir}/{contract_name}_terms.json"
    with open(terms_path, "w") as f:
        json.dump({"terms": terms, "risks": risks}, f, indent=2)
    
    return {
        "report": report_path,
        "terms": terms_path,
        "risk_count": len(risks),
        "high_risks": len(high_risks)
    }

# Run it
if __name__ == "__main__":
    result = run_contract_review("contracts/vendor_msa.pdf")
    print(f"Done. Report: {result['report']}")
    print(f"High severity risks: {result['high_risks']}")

Three Claude API calls per contract. For a 20-page MSA at current Sonnet pricing (~$3/M input tokens, $15/M output), you’re looking at roughly $0.02–$0.05 per document. That’s well within range for any commercial use case — even at 500 contracts/month you’re under $25 in API costs.

Production Hardening: What You Need Before Going Live

The code above works. What it doesn’t handle yet:

Retry logic — wrap each Claude call with exponential backoff using tenacity. API timeouts on long documents happen.
Token limit enforcement — contracts over ~150K tokens need chunking strategy. Count tokens with client.count_tokens() before sending.
Async processing — for web apps, run the review pipeline as a background job (Celery, RQ, or a simple queue). Don’t block an HTTP request on a 30-second LLM chain.
Prompt versioning — store your prompt templates in a config file with version numbers. When you update a prompt, you want to know which version produced which report.
Human review flag — any contract with 3+ high-severity risks should route to a human. Never make this fully automated for anything with real legal consequences.

When to Use This vs. Buy an Off-the-Shelf Tool

Build this yourself if: you need custom risk rules per client type, you’re embedding contract review into a larger product, or you’re processing contracts programmatically in a pipeline (post-signature compliance checks, vendor onboarding automation).

Use something like Ironclad, Spellbook, or Docusign AI if: you’re a law firm that needs a polished UI, audit trails, and support SLAs. The build-vs-buy math only favors building when you have engineering resources and genuinely custom requirements.

For solo founders and small ops teams, this contract review agent gives you 80% of what the expensive tools provide at 5% of the cost — with full control over the risk logic that matters for your specific use case. That’s the practical sweet spot this approach was designed for.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Building an AI-Powered Contract Review Agent: Document Upload, Analysis, and Reporting

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Building an AI-Powered Contract Review Agent: Document Upload, Analysis, and Reporting

Why Claude for Contract Analysis (and Where It Struggles)

Project Architecture: What We’re Actually Building

Setting Up Dependencies

Stage 1: PDF Extraction That Actually Works

Stage 2: Structured Term Extraction

Stage 3: Risk Analysis Against a Configurable Ruleset

Defining Your Risk Rules

Stage 4: Generating the Report

Wiring It All Together

Production Hardening: What You Need Before Going Live

When to Use This vs. Buy an Off-the-Shelf Tool

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation